This report performs SVD-based alignment analysis between router vectors and expert weight matrices.
Plot Explanations
The following section explains what each plot type shows and how it is computed. All plots use layer numbers (e.g., L5, L10) in their legends, without timestamps.
Comparison Plots
Comparison Plots: This figure contains four subplots comparing multiple analysis runs:
- Alignment vs k: Shows the mean alignment score (projection energy) as a function of k (number of top singular vectors used).
- Formula: align(k) = Σᵢ₌₁ᵏ (vᵢᵀ · r)², where vᵢ are the top-k right singular vectors from SVD of expert weight matrix, and r is the normalized router vector.
- Interpretation: The alignment score is the sum of squared projections of the router vector onto the top-k right singular vectors of the expert weight matrix. For k=1, this equals cos²(θ) between the router vector and the top singular vector. For k>1, it sums the squared projections across multiple singular vectors, measuring how much of the router vector's energy lies in the top-k dimensional subspace of the expert.
- Computation: (1) Perform SVD on each expert weight matrix W to get right singular vectors V (columns are singular vectors), (2) Normalize the router vector to unit length, (3) Project the normalized router vector onto the top-k columns of V: proj = V[:, :k]ᵀ @ router_vec, (4) Sum the squared projections: align = Σ(proj²).
- Range: [0, 1]. Value of 1 means the router vector lies entirely in the top-k subspace. Value of 0 means it's orthogonal to that subspace.
- Higher values indicate: Stronger alignment - the router vector is well-aligned with the principal directions of the expert weight matrix.
- Z-score vs k: Shows the z-score of alignment compared to shuffled baselines.
- Formula: z(k) = (align(k) - shuffle_mean(k)) / shuffle_std(k)
- Interpretation: Measures how many standard deviations the actual alignment is above the shuffle baseline. This is a normalized measure of statistical significance.
- Computation: (1) Compute shuffle_mean and shuffle_std by shuffling router-expert assignments many times (typically 200) and computing alignment for each shuffle, (2) Calculate z-score = (actual_alignment - shuffle_mean) / shuffle_std.
- Interpretation thresholds: z > 2 indicates ~95% confidence, z > 3 indicates ~99.7% confidence that alignment exceeds chance. Values near 0 indicate alignment is consistent with random assignments.
- Higher values indicate: More statistically significant alignment above the shuffle baseline.
- Effect over Random vs k: Shows the effect size over theoretical random baseline.
- Formula: effect_over_random(k) = align(k) - (k / d_model)
- Interpretation: Measures how much the actual alignment exceeds the theoretical expectation if the router vector were randomly oriented in d_model-dimensional space. The baseline k/d_model is the expected projection energy onto a random k-dimensional subspace.
- Computation: (1) Calculate theoretical baseline: random_expect = k / d_model (where d_model is the model dimension, typically 4096), (2) Subtract from actual alignment: effect = align - random_expect.
- Baseline explanation: If a unit vector is randomly oriented in d_model dimensions, the expected projection energy onto any k-dimensional subspace is k/d_model. This is a simple analytical result from random matrix theory.
- Positive values indicate: Alignment exceeds theoretical random expectation. Negative values indicate alignment is below even random expectation (rare but possible).
- Limitation: This baseline assumes completely random orientation and doesn't account for the actual structure of router and expert vectors.
- Delta vs Shuffle vs k: Shows the difference between actual alignment and empirical shuffle baseline.
- Formula: delta(k) = align(k) - shuffle_mean(k)
- Interpretation: Measures the raw difference between actual alignment and the empirical mean from shuffled assignments. This preserves the actual structure of router and expert vectors but randomizes which router is assigned to which expert.
- Computation: (1) Perform many shuffles (typically 200): randomly permute which router vector is assigned to which expert, (2) For each shuffle, compute alignment using the shuffled assignments, (3) Calculate shuffle_mean = mean of all shuffle alignments, (4) Calculate delta = actual_alignment - shuffle_mean.
- Why it's more realistic: Unlike the theoretical baseline, this preserves the actual structure and magnitude of router and expert vectors. It only randomizes the assignment, making it a more appropriate null hypothesis for testing whether specific router-expert pairs are aligned.
- Positive values indicate: Actual alignment exceeds the empirical shuffle baseline, suggesting meaningful router-expert alignment beyond random assignment.
- Comparison to Effect over Random: Delta vs Shuffle is typically more conservative (smaller values) because shuffle_mean accounts for the actual vector structures, whereas k/d_model assumes completely random vectors.
Cos²(θ) Expert Comparison
Cos²(θ) Expert Comparison: This figure compares cos²(θ) values across experts and layers (for k=1 only):
- Left plot: Shows cos²(θ) per expert for k=1 (using only the top singular vector). Each line represents a different layer.
- Formula: cos²(θ) = (rᵀ · v₁)², where r is the normalized router vector and v₁ is the top right singular vector (first column of V from SVD).
- Interpretation: Measures the squared cosine of the angle between the router vector and the principal direction (top singular vector) of the expert weight matrix. This is a direct correlation measure indicating how well-aligned the router is with the expert's primary direction.
- Computation: (1) Perform SVD on expert weight matrix to get V (right singular vectors), (2) Extract v₁ = V[:, 0] (top singular vector), (3) Normalize router vector to unit length, (4) Compute cos²(θ) = (router_vecᵀ · v₁)².
- Relationship to alignment: For k=1, cos²(θ) = align(k=1). For k>1, align(k) = Σᵢ₌₁ᵏ cos²(θᵢ) where θᵢ is the angle with the i-th singular vector.
- Range: [0, 1]. Value of 1 means router is perfectly aligned with top singular vector. Value of 0 means router is orthogonal to it.
- Higher values indicate: Stronger alignment between router and expert's principal direction.
- Right plot: Shows mean cos²(θ) across all experts for each layer at k=1. Bar heights represent the average alignment strength per layer.
- Formula: mean_cos²(θ) = (1/n_experts) · Σᵢ cos²(θᵢ), where the sum is over all experts in the layer.
- Interpretation: Average alignment strength across all experts in a layer. Provides a layer-level summary of router-expert alignment.
- Computation: (1) Compute cos²(θ) for each expert at k=1, (2) Average across all experts in the layer.
- Use case: Compare alignment strength across different layers. Higher values indicate stronger overall alignment in that layer.
Shuffle Statistics
Shuffle Statistics: This figure shows statistics from shuffled baseline comparisons:
- Shuffle Mean vs k: Mean alignment value from shuffled router-expert assignments as a function of k.
- Formula: shuffle_mean(k) = (1/n_shuffles) · Σᵢ align_shuffled_i(k), where align_shuffled_i is the alignment computed with the i-th shuffled assignment.
- Interpretation: The expected alignment under the null hypothesis that router-expert assignments are random. This is the empirical baseline used for statistical comparison.
- Computation: (1) For each shuffle iteration (typically 200): randomly permute which router vector is assigned to which expert, (2) Compute alignment for each shuffled assignment using the same projection energy formula, (3) Average all shuffle alignments: shuffle_mean = mean(align_shuffled).
- Why it matters: This provides the null distribution mean. Actual alignment significantly above this suggests meaningful structure beyond random assignment.
- Typical behavior: Usually increases with k (more dimensions = higher projection energy), but typically lower than actual alignment when there's real structure.
- Shuffle Std vs k: Standard deviation of alignment values from shuffled assignments.
- Formula: shuffle_std(k) = std(align_shuffled(k)) = √[(1/(n-1)) · Σᵢ (align_shuffled_i - shuffle_mean)²]
- Interpretation: Measures the variability in alignment when router-expert assignments are randomized. Larger values indicate more uncertainty in the null distribution.
- Computation: (1) Compute alignments for all shuffle iterations, (2) Calculate standard deviation across all shuffle alignments.
- Use in z-score: Used as the denominator in z-score calculation: z = (align - shuffle_mean) / shuffle_std. Larger std means smaller z-scores for the same delta, making significance harder to achieve.
- Typical behavior: Usually increases with k, as higher-dimensional projections have more variability.
- Log Shuffle Std vs k: Logarithm of shuffle standard deviation.
- Formula: log_shuffle_std(k) = log(shuffle_std(k) + ε), where ε is a small constant (typically 1e-10) to avoid log(0).
- Interpretation: Logarithmic scale makes it easier to visualize exponential or power-law relationships in the standard deviation.
- Computation: (1) Compute shuffle_std as above, (2) Apply natural logarithm: log(std + ε).
- Why use log scale: If std grows exponentially with k, the log plot will show a linear relationship, making patterns easier to identify.
- Use case: Helps identify whether variability grows exponentially, linearly, or sub-linearly with k.
Z-score Decomposition
Z-score Decomposition: This figure breaks down the z-score calculation into its components:
- Delta vs k: The numerator of z-score.
- Formula: Δ(k) = align(k) - shuffle_mean(k)
- Interpretation: The raw difference between actual alignment and the shuffle baseline. This is the "effect size" before normalization.
- Computation: (1) Compute actual alignment for each k, (2) Compute shuffle_mean for each k (from shuffle statistics), (3) Calculate delta = align - shuffle_mean for each k.
- Units: Same as alignment (dimensionless, range [0, 1] for alignment, so delta can be negative or positive).
- Positive values indicate: Actual alignment exceeds shuffle baseline. Negative values indicate actual alignment is below shuffle baseline (rare but possible).
- Relationship to z-score: Delta is the numerator. Larger delta (with same std) leads to larger z-score.
- Shuffle Std vs k: The denominator of z-score.
- Formula: σ_shuffle(k) = std(align_shuffled(k))
- Interpretation: The variability in shuffled alignments. This is the same metric shown in Shuffle Statistics plot, but displayed here to show its role in z-score normalization.
- Computation: Standard deviation across all shuffle iterations for each k value. Same as described in Shuffle Statistics.
- Role in z-score: Acts as the normalization factor. Larger std means the same delta produces a smaller z-score, making it harder to achieve statistical significance.
- Why it matters: Understanding std helps interpret z-scores. A large delta with large std might have a moderate z-score, while a smaller delta with small std might have a large z-score.
- Z-score vs k: The final z-score.
- Formula: z(k) = Δ(k) / σ_shuffle(k) = (align(k) - shuffle_mean(k)) / shuffle_std(k)
- Interpretation: The number of standard deviations the actual alignment is above (or below) the shuffle baseline. This is a normalized measure of statistical significance.
- Computation: (1) Compute delta for each k, (2) Compute shuffle_std for each k, (3) Calculate z = delta / shuffle_std for each k.
- Statistical interpretation: Under the null hypothesis (random assignments), z follows approximately a standard normal distribution. z > 2 indicates ~95% confidence (p < 0.05), z > 3 indicates ~99.7% confidence (p < 0.003) that alignment exceeds chance.
- Advantages over delta: Normalized measure that accounts for variability. A delta of 0.1 might be significant if std=0.02 (z=5) but not if std=0.1 (z=1).
- Higher values indicate: More statistically significant alignment above the shuffle baseline.
Distribution Comparison
Distribution Comparison: This plot shows the probability distribution of shuffled alignments (projection energies) compared to the true alignment value:
- Solid curves: Approximate normal distribution of alignment values (projection energies) from shuffled router-expert assignments.
- Formula: P(align) ≈ N(μ_shuffle, σ²_shuffle), where μ_shuffle = shuffle_mean and σ_shuffle = shuffle_std.
- Interpretation: The probability distribution of what alignment values we would expect by chance when router vectors are randomly assigned to experts. This is the null distribution for statistical testing.
- Computation: (1) Compute shuffle_mean and shuffle_std from shuffle experiments, (2) Approximate the distribution as a normal distribution: N(shuffle_mean, shuffle_std²), (3) Plot the probability density function using scipy.stats.norm.pdf(x, shuffle_mean, shuffle_std).
- Why normal distribution: By the Central Limit Theorem, the mean of many independent shuffle alignments approximates a normal distribution, especially with 200 shuffles.
- What it shows: The range and likelihood of alignment values under the null hypothesis. The peak is at shuffle_mean, and the width is determined by shuffle_std.
- Dashed vertical lines: The actual alignment value (projection energy) for each run.
- Formula: true_align(k) = (1/n_experts) · Σᵢ align_i(k), where align_i is the alignment for expert i at k.
- Interpretation: The observed mean alignment across all experts for the given k value. This is what we're testing against the null distribution.
- Computation: (1) Compute alignment for each expert at the given k value, (2) Average across all experts: true_align = mean(align_expert).
- Position relative to distribution: If the line is far to the right of the distribution peak (shuffle_mean), it indicates strong alignment above chance. The distance from the peak, measured in standard deviations, corresponds to the z-score.
- Statistical interpretation: If the line falls in the right tail of the distribution (beyond ~2σ), it suggests the alignment is statistically significant (p < 0.05).
- Multiple runs: Each run gets its own dashed line, allowing comparison of alignment strength across different layers or configurations.
Per-Expert Breakdown
Per-Expert Breakdown: This figure provides detailed expert-level analysis:
- Alignment Heatmap (top left): Shows alignment values (projection energies) for each expert (rows) across different k values (columns).
- Formula: heatmap[expert, k] = mean(align_expert,k), averaged across any multiple runs if present.
- Interpretation: Visual representation of how alignment strength varies across experts and k values. Each cell shows the mean alignment for a specific expert-k combination.
- Computation: (1) Group data by expert and k, (2) Average alignment values within each group, (3) Create pivot table: pivot = df.pivot_table(values='align', index='expert', columns='k', aggfunc='mean'), (4) Display as heatmap with color intensity proportional to alignment value.
- Color scheme: Warmer colors (yellow/green) indicate stronger alignment, cooler colors (blue/purple) indicate weaker alignment. Uses 'viridis' colormap.
- What to look for: Patterns across experts (rows) show which experts have consistently high/low alignment. Patterns across k (columns) show how alignment changes with dimensionality.
- Use case: Identify experts with particularly strong or weak alignment, and see how alignment scales with k for each expert.
- Delta Heatmap (top right): Shows delta (alignment - shuffle_mean) for each expert across k values.
- Formula: heatmap[expert, k] = mean(align_expert,k - shuffle_mean_k), averaged across runs if present.
- Interpretation: Shows how much each expert's alignment exceeds (or falls below) the shuffle baseline at each k value.
- Computation: (1) Compute delta for each expert-k combination: delta = align - shuffle_mean, (2) Create pivot table: pivot = df.pivot_table(values='delta_vs_shuffle', index='expert', columns='k', aggfunc='mean'), (3) Display as heatmap with colormap centered at zero.
- Color scheme: Red indicates positive delta (above shuffle baseline), blue indicates negative delta (below shuffle baseline). Uses 'RdBu_r' (Red-Blue reversed) colormap, centered at zero using vmin=-max_abs, vmax=max_abs.
- What to look for: Experts with consistently red cells have strong alignment above baseline. Experts with blue cells have alignment below baseline (rare but possible).
- Advantage over alignment heatmap: Normalized by shuffle baseline, making it easier to see which experts truly exceed chance expectations.
- Alignment vs Expert (bottom left): Scatter plot showing alignment values (projection energies) for each expert at a fixed k (typically k=128).
- Formula: For each expert i: align_i(k_fixed), where k_fixed is typically 128 or the median k value if 128 is not available.
- Interpretation: Shows the distribution of alignment strengths across experts at a representative k value. Each point represents one expert's alignment.
- Computation: (1) Select a representative k value (prefer k=128, fallback to median k), (2) Filter data: k_data = df[df['k'] == k_fixed], (3) Extract alignment values: align_vals = k_data['align'].values, expert_vals = k_data['expert'].values, (4) Plot as scatter: scatter(expert_vals, align_vals).
- Why scatter plot: Shows individual expert values rather than averages, revealing variability and outliers.
- What to look for: Experts with particularly high or low alignment values. Clustering of points suggests similar alignment strengths across experts.
- Multiple runs: If comparing multiple runs, each run gets a different color/marker, allowing comparison of alignment patterns across layers or configurations.
- Delta vs Expert (bottom right): Scatter plot showing delta values for each expert at a fixed k.
- Formula: For each expert i: delta_i(k_fixed) = align_i(k_fixed) - shuffle_mean(k_fixed).
- Interpretation: Shows how much each expert's alignment exceeds (or falls below) the shuffle baseline at a representative k value.
- Computation: (1) Use the same k_fixed as in Alignment vs Expert plot, (2) Filter data: k_data = df[df['k'] == k_fixed], (3) Extract delta values: delta_vals = k_data['delta_vs_shuffle'].values, expert_vals = k_data['expert'].values, (4) Plot as scatter: scatter(expert_vals, delta_vals), (5) Add horizontal line at y=0 for reference.
- Reference line: The horizontal dashed line at y=0 separates experts above baseline (positive delta) from those below baseline (negative delta).
- What to look for: Experts with delta significantly above zero have strong alignment. Most experts should have positive delta if there's meaningful structure. Negative delta is rare but indicates alignment below even random assignment.
- Advantage over Alignment vs Expert: Normalized by shuffle baseline, making it easier to identify experts with statistically meaningful alignment.
- Multiple runs: If comparing multiple runs, each run gets different markers, showing how delta patterns vary across layers or configurations.
Complete Analysis Visualization
Complete Analysis Visualization: This comprehensive figure contains 12 subplots showing all key metrics for a single analysis run:
- Row 1: Alignment, Z-score, and Effect over Random vs k (log scale)
- Row 2: Delta vs Shuffle, Shuffle Mean, and Shuffle Std vs k
- Row 3: Heatmaps showing Alignment, Delta, and Z-score across experts (rows) and k values (columns)
- Row 4: Cos²(θ) per expert (k=1), Alignment per expert, and Z-score per expert at a representative k value
All metrics are computed as described in the individual plot explanations above.
Ambiguity Score Analysis
Per-Layer Ambiguity Score Analysis: This analysis combines multiple metrics to identify layers with potential routing instability and load imbalance.
- Purpose: The Ambiguity Score quantifies how "ambiguous" a layer is in terms of router-expert alignment. Higher scores indicate layers where routing decisions may be less clear, potentially leading to instability or load imbalance.
- Method:
- Step 1 - Select k*: For each layer, select a fixed k* value. Options include:
- max_margin: Use the k value where alignment margin is maximal (default, recommended)
- fixed_128, fixed_256, fixed_512: Use a fixed k value across all layers
- Step 2 - Extract Metrics at k*: For each layer, extract:
- Argmax Accuracy: Fraction of routers where correct expert has maximum alignment
- Alignment Margin: Mean difference between correct expert's alignment and next-best expert's alignment
- Z-Score: Mean z-score versus shuffled router-expert pairings
- Step 3 - Normalize Across Layers: Normalize margins and z-scores to [0, 1] range across all layers for fair comparison.
- Step 4 - Compute Ambiguity Score: Weighted combination:
- Formula: AmbiguityScore = α·(1 - ArgmaxAccuracy) + β·(1 - NormalizedMargin) + γ·(1 - NormalizedZScore)
- Default weights: α = 0.4, β = 0.3, γ = 0.3 (sum to 1.0)
- Interpretation: Each component measures a different aspect of ambiguity:
- (1 - ArgmaxAccuracy): How often routers fail to identify the correct expert (higher = more ambiguous)
- (1 - NormalizedMargin): How small the separation is between correct and incorrect experts (higher = more ambiguous)
- (1 - NormalizedZScore): How weak the statistical significance is (higher = more ambiguous)
- Step 5 - Rank Layers: Sort layers by ambiguity score (descending). Highest scores indicate most ambiguous layers.
- Visualization:
- Plot 1 - Ambiguity Score by Layer: Bar chart showing ambiguity score for each layer. Higher bars indicate more ambiguous layers.
- Plot 2 - Components: Stacked or grouped bar chart showing the three components (1 - ArgmaxAccuracy, 1 - NormalizedMargin, 1 - NormalizedZScore) for each layer.
- Plot 3 - Argmax Accuracy by Layer: Line plot showing argmax accuracy across layers. Lower values indicate more ambiguity.
- Plot 4 - Alignment Margin by Layer: Line plot showing alignment margin across layers. Lower (or negative) values indicate more ambiguity.
- Interpretation:
- High Ambiguity Score (> 0.7): Layer has weak unique identification, small margins, and/or low z-scores. These layers are at high risk of routing instability and load imbalance.
- Medium Ambiguity Score (0.4-0.7): Layer has moderate ambiguity. Some routing decisions may be unclear, but not critically unstable.
- Low Ambiguity Score (< 0.4): Layer has strong unique identification, large margins, and high z-scores. These layers have clear routing decisions and are less likely to have stability issues.
- Ranking: Layers ranked highest (top of the list) are the most ambiguous and should be prioritized for investigation or intervention.
- Use Cases:
- Identify Problem Layers: Quickly identify which layers have the most ambiguous routing, helping prioritize debugging or optimization efforts.
- Compare Architectures: Compare ambiguity scores across different model architectures or training configurations.
- Monitor Training: Track ambiguity scores during training to detect when layers become more ambiguous (potential sign of training issues).
- Load Balancing: Layers with high ambiguity scores may benefit from load balancing interventions or routing regularization.
- Limitations:
- The score depends on the choice of k* and weights (α, β, γ). Different choices may yield different rankings.
- Normalization across layers assumes all layers should be compared on the same scale, which may not always be appropriate.
- High ambiguity doesn't necessarily mean the layer is "bad" - it may be intentionally ambiguous for certain tasks.
Inter-Expert Orthogonality
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
- Method:
- Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
- Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
- Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
- Visualization:
- Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
- Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
- Grid: Grid lines separate cells for better readability.
- Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
- Interpretation:
- Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
- High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
- Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
- Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
- Statistics Reported:
- Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
- Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
- Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
- Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
- Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
- Why This Matters:
- Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
- Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
- Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Unique Identification Analysis
Unique Identification Analysis: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. It goes beyond shuffled-baseline tests to assess whether routing vectors encode expert identity in a separable and discriminative manner.
- Argmax Accuracy vs k: Fraction of router vectors where the correct expert achieves the maximum alignment among all experts.
- Formula: For each router vector r_i (assigned to expert i), compute alignment with ALL experts: align(r_i, Expert_j) for all j. Argmax accuracy = (1/n) · Σᵢ [argmax_j align(r_i, Expert_j) == i], where [·] is 1 if true, 0 otherwise.
- Interpretation: Measures whether the correct expert-expert pairing can be uniquely identified from the alignment matrix. Value of 1.0 means perfect identification (every router's correct expert has the highest alignment). Value of 1/n_experts (e.g., 0.125 for 8 experts) means random guessing.
- Computation: (1) For each router vector r_i, compute alignment with all experts' principal subspaces, forming a row of the alignment matrix, (2) Find which expert has maximum alignment: argmax_j align(r_i, Expert_j), (3) Check if argmax equals the correct expert i, (4) Average across all routers to get accuracy.
- Range: [1/n_experts, 1.0]. Value of 1.0 indicates perfect unique identification. Value near 1/n_experts indicates alignment is not discriminative enough to identify experts uniquely.
- Comparison to shuffled baseline: While shuffled baseline tests whether alignment is above chance, argmax accuracy tests whether alignment is strong enough to uniquely identify the correct expert from all possible experts.
- Alignment Margin vs k: Mean difference between correct expert's alignment and the next-best expert's alignment.
- Formula: For each router vector r_i: margin_i = align(r_i, Expert_i) - max_{j≠i} align(r_i, Expert_j). Mean margin = (1/n) · Σᵢ margin_i.
- Interpretation: Measures the separation between correct and incorrect expert alignments. Positive margin means correct expert has higher alignment than all others (unique identification possible). Negative margin means another expert has higher alignment (misidentification).
- Computation: (1) For each router r_i, compute alignments with all experts, (2) Get correct alignment: align_correct = align(r_i, Expert_i), (3) Get maximum among other experts: align_max_other = max_{j≠i} align(r_i, Expert_j), (4) Compute margin = align_correct - align_max_other, (5) Average across all routers.
- Range: Can be negative or positive. Positive values indicate correct expert is best (unique identification). Negative values indicate another expert is better (confusion). Larger positive margins indicate stronger discriminative power.
- Relationship to argmax accuracy: When margin > 0, argmax accuracy = 1.0. When margin < 0, argmax accuracy < 1.0. Margin quantifies the strength of separation even when argmax is correct.
- Alignment Matrix: Full [n_experts × n_experts] matrix where entry (i, j) is the alignment of router vector i with expert j's principal subspace.
- Formula: Matrix[i, j] = align(r_i, Expert_j) = projection energy of router i onto expert j's top-k singular subspace.
- Interpretation: Shows the complete alignment landscape. Diagonal entries (i, i) are correct pairings. Off-diagonal entries show how well routers align with "wrong" experts. For unique identification, diagonal should be the maximum in each row.
- Computation: For each router-expert pair (i, j), compute alignment using the same projection energy formula as the main analysis.
- What to look for: Strong diagonal pattern (diagonal entries are highest in each row) indicates unique identification. Weak diagonal or strong off-diagonal entries indicate confusion between experts.
- Margin Distribution: Histogram of alignment margins across all router vectors for different k values.
- Interpretation: Shows the distribution of discriminative power. Most routers with positive margins indicate good unique identification. Many routers with negative margins indicate frequent misidentification.
- What to look for: Distribution shifted to the right (positive values) indicates strong unique identification. Distribution centered near zero or shifted left indicates weak or no unique identification.
Interpretation Summary:
- Argmax accuracy near 1.0 + positive margins: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a separable and discriminative manner.
- Argmax accuracy above random but < 1.0: Alignment is above chance but not strong enough for perfect unique identification. Some routers may be confused with other experts.
- Argmax accuracy near random baseline: Alignment is not discriminative enough to identify experts uniquely, even though it may be above the shuffled baseline (non-random but not uniquely identifying).
- Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement.
1. Comparison Across All Runs
This section compares all 6 result files side by side.
Comparison Plots
See "Plot Explanations" section at the top of this report for detailed information about this plot.
Unique Identification Comparison
See "Plot Explanations" section at the top of this report for detailed information about this plot.
Cos²(θ) Expert Comparison
See "Plot Explanations" section at the top of this report for detailed information about this plot.
Diagnostic Plots (Comparison)
Diagnostic plots comparing all runs to understand differences in alignment, z-score, and delta.
Shuffle Statistics
See "Plot Explanations" section at the top of this report for detailed information about this plot.
Z-score Decomposition
See "Plot Explanations" section at the top of this report for detailed information about this plot.
Distribution Comparison (k=32)
See "Plot Explanations" section at the top of this report for detailed information about this plot.
Distribution Comparison (k=128)
See "Plot Explanations" section at the top of this report for detailed information about this plot.
Distribution Comparison (k=512)
See "Plot Explanations" section at the top of this report for detailed information about this plot.
Distribution Comparison (k=2048)
See "Plot Explanations" section at the top of this report for detailed information about this plot.
Per-Expert Breakdown
See "Plot Explanations" section at the top of this report for detailed information about this plot.
Inter-Expert Orthogonality Analysis (Comparison)
This section compares expert-to-expert orthogonality across all runs. Lower off-diagonal similarity indicates better expert diversity.
Inter-Expert Orthogonality - L0
Statistics:
- Mean off-diagonal similarity: -0.0603
- Mean absolute off-diagonal similarity: 0.3233
- Max off-diagonal similarity: 0.8906
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
- Method:
- Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
- Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
- Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
- Visualization:
- Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
- Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
- Grid: Grid lines separate cells for better readability.
- Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
- Interpretation:
- Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
- High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
- Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
- Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
- Statistics Reported:
- Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
- Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
- Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
- Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
- Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
- Why This Matters:
- Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
- Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
- Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality - L1
Statistics:
- Mean off-diagonal similarity: 0.3233
- Mean absolute off-diagonal similarity: 0.3306
- Max off-diagonal similarity: 0.8950
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
- Method:
- Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
- Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
- Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
- Visualization:
- Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
- Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
- Grid: Grid lines separate cells for better readability.
- Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
- Interpretation:
- Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
- High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
- Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
- Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
- Statistics Reported:
- Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
- Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
- Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
- Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
- Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
- Why This Matters:
- Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
- Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
- Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality - L9
Statistics:
- Mean off-diagonal similarity: -0.0458
- Mean absolute off-diagonal similarity: 0.4108
- Max off-diagonal similarity: 0.6760
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
- Method:
- Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
- Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
- Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
- Visualization:
- Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
- Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
- Grid: Grid lines separate cells for better readability.
- Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
- Interpretation:
- Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
- High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
- Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
- Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
- Statistics Reported:
- Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
- Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
- Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
- Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
- Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
- Why This Matters:
- Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
- Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
- Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality - L11
Statistics:
- Mean off-diagonal similarity: 0.1829
- Mean absolute off-diagonal similarity: 0.3728
- Max off-diagonal similarity: 0.6411
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
- Method:
- Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
- Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
- Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
- Visualization:
- Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
- Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
- Grid: Grid lines separate cells for better readability.
- Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
- Interpretation:
- Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
- High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
- Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
- Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
- Statistics Reported:
- Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
- Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
- Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
- Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
- Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
- Why This Matters:
- Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
- Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
- Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality - L21
Statistics:
- Mean off-diagonal similarity: 0.2060
- Mean absolute off-diagonal similarity: 0.3981
- Max off-diagonal similarity: 0.7257
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
- Method:
- Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
- Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
- Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
- Visualization:
- Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
- Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
- Grid: Grid lines separate cells for better readability.
- Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
- Interpretation:
- Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
- High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
- Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
- Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
- Statistics Reported:
- Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
- Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
- Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
- Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
- Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
- Why This Matters:
- Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
- Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
- Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality - L30
Statistics:
- Mean off-diagonal similarity: 0.1581
- Mean absolute off-diagonal similarity: 0.4320
- Max off-diagonal similarity: 0.6343
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
- Method:
- Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
- Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
- Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
- Visualization:
- Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
- Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
- Grid: Grid lines separate cells for better readability.
- Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
- Interpretation:
- Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
- High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
- Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
- Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
- Statistics Reported:
- Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
- Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
- Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
- Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
- Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
- Why This Matters:
- Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
- Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
- Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Unique Identification Analysis (Comparison)
This section compares unique identification metrics across all runs. These metrics test whether alignment is strong enough to uniquely identify experts, not just above chance.
⚠️ Could not generate unique identification comparison plots: 'numpy.ndarray' object has no attribute 'axis'
Per-Layer Ambiguity Score Analysis
Purpose: The Ambiguity Score combines multiple metrics (argmax accuracy, alignment margin, z-score) to identify layers with potential routing instability and load imbalance. Higher scores indicate more ambiguous layers.
Formula: AmbiguityScore = α·(1 - ArgmaxAccuracy) + β·(1 - NormalizedMargin) + γ·(1 - NormalizedZScore)
Interpretation: Layers ranked highest by ambiguity score are hypothesized to be at higher risk of routing instability and load imbalance.
Ambiguity Score Rankings (sorted by score, highest ambiguity first):
| Rank |
Layer |
Ambiguity Score |
k* |
Argmax Accuracy |
Alignment Margin |
Z-Score |
| 1 |
0 |
0.9500 |
4096 |
0.125 |
0.000000 |
0.00 |
| 2 |
1 |
0.3115 |
16 |
1.000 |
0.062931 |
2.04 |
| 3 |
11 |
0.2291 |
512 |
1.000 |
0.146158 |
2.15 |
| 4 |
9 |
0.0975 |
1024 |
1.000 |
0.261166 |
2.46 |
| 5 |
30 |
0.0725 |
128 |
1.000 |
0.278289 |
2.55 |
| 6 |
21 |
0.0000 |
64 |
1.000 |
0.357489 |
2.60 |
Ambiguity Score Visualization
Per-Layer Ambiguity Score Analysis: This analysis combines multiple metrics to identify layers with potential routing instability and load imbalance.
- Purpose: The Ambiguity Score quantifies how "ambiguous" a layer is in terms of router-expert alignment. Higher scores indicate layers where routing decisions may be less clear, potentially leading to instability or load imbalance.
- Method:
- Step 1 - Select k*: For each layer, select a fixed k* value. Options include:
- max_margin: Use the k value where alignment margin is maximal (default, recommended)
- fixed_128, fixed_256, fixed_512: Use a fixed k value across all layers
- Step 2 - Extract Metrics at k*: For each layer, extract:
- Argmax Accuracy: Fraction of routers where correct expert has maximum alignment
- Alignment Margin: Mean difference between correct expert's alignment and next-best expert's alignment
- Z-Score: Mean z-score versus shuffled router-expert pairings
- Step 3 - Normalize Across Layers: Normalize margins and z-scores to [0, 1] range across all layers for fair comparison.
- Step 4 - Compute Ambiguity Score: Weighted combination:
- Formula: AmbiguityScore = α·(1 - ArgmaxAccuracy) + β·(1 - NormalizedMargin) + γ·(1 - NormalizedZScore)
- Default weights: α = 0.4, β = 0.3, γ = 0.3 (sum to 1.0)
- Interpretation: Each component measures a different aspect of ambiguity:
- (1 - ArgmaxAccuracy): How often routers fail to identify the correct expert (higher = more ambiguous)
- (1 - NormalizedMargin): How small the separation is between correct and incorrect experts (higher = more ambiguous)
- (1 - NormalizedZScore): How weak the statistical significance is (higher = more ambiguous)
- Step 5 - Rank Layers: Sort layers by ambiguity score (descending). Highest scores indicate most ambiguous layers.
- Visualization:
- Plot 1 - Ambiguity Score by Layer: Bar chart showing ambiguity score for each layer. Higher bars indicate more ambiguous layers.
- Plot 2 - Components: Stacked or grouped bar chart showing the three components (1 - ArgmaxAccuracy, 1 - NormalizedMargin, 1 - NormalizedZScore) for each layer.
- Plot 3 - Argmax Accuracy by Layer: Line plot showing argmax accuracy across layers. Lower values indicate more ambiguity.
- Plot 4 - Alignment Margin by Layer: Line plot showing alignment margin across layers. Lower (or negative) values indicate more ambiguity.
- Interpretation:
- High Ambiguity Score (> 0.7): Layer has weak unique identification, small margins, and/or low z-scores. These layers are at high risk of routing instability and load imbalance.
- Medium Ambiguity Score (0.4-0.7): Layer has moderate ambiguity. Some routing decisions may be unclear, but not critically unstable.
- Low Ambiguity Score (< 0.4): Layer has strong unique identification, large margins, and high z-scores. These layers have clear routing decisions and are less likely to have stability issues.
- Ranking: Layers ranked highest (top of the list) are the most ambiguous and should be prioritized for investigation or intervention.
- Use Cases:
- Identify Problem Layers: Quickly identify which layers have the most ambiguous routing, helping prioritize debugging or optimization efforts.
- Compare Architectures: Compare ambiguity scores across different model architectures or training configurations.
- Monitor Training: Track ambiguity scores during training to detect when layers become more ambiguous (potential sign of training issues).
- Load Balancing: Layers with high ambiguity scores may benefit from load balancing interventions or routing regularization.
- Limitations:
- The score depends on the choice of k* and weights (α, β, γ). Different choices may yield different rankings.
- Normalization across layers assumes all layers should be compared on the same scale, which may not always be appropriate.
- High ambiguity doesn't necessarily mean the layer is "bad" - it may be intentionally ambiguous for certain tasks.
2. Individual Analysis - Run 1
Setup and Configuration
Summary Statistics (averaged across experts)
| k |
align |
delta_vs_shuffle |
z_vs_shuffle |
effect_over_random |
cos_squared |
argmax_accuracy |
alignment_margin |
| 1 |
0.005770 |
0.002528 |
0.684055 |
0.005526 |
0.005770 |
0.500000 |
-0.001939 |
| 2 |
0.007577 |
0.001943 |
0.400250 |
0.007089 |
0.000000 |
0.250000 |
-0.005173 |
| 4 |
0.010832 |
0.001343 |
0.189122 |
0.009855 |
0.000000 |
0.125000 |
-0.008038 |
| 8 |
0.016802 |
-0.007680 |
-0.556915 |
0.014849 |
0.000000 |
0.000000 |
-0.026486 |
| 16 |
0.027283 |
-0.024356 |
-1.033393 |
0.023376 |
0.000000 |
0.000000 |
-0.054445 |
| 32 |
0.051706 |
-0.043072 |
-1.131536 |
0.043893 |
0.000000 |
0.000000 |
-0.086332 |
| 64 |
0.117375 |
-0.040200 |
-0.718772 |
0.101750 |
0.000000 |
0.125000 |
-0.095110 |
| 128 |
0.218082 |
-0.043899 |
-0.648009 |
0.186832 |
0.000000 |
0.125000 |
-0.111610 |
| 256 |
0.414622 |
-0.028932 |
-0.422043 |
0.352122 |
0.000000 |
0.250000 |
-0.088238 |
| 512 |
0.637770 |
-0.013315 |
-0.271164 |
0.512770 |
0.000000 |
0.125000 |
-0.057387 |
| 1024 |
0.821602 |
0.007298 |
0.219326 |
0.571602 |
0.000000 |
0.125000 |
-0.031759 |
| 2048 |
0.929245 |
0.005412 |
0.370229 |
0.429245 |
0.000000 |
0.250000 |
-0.014026 |
| 4096 |
1.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.125000 |
0.000000 |
Cos²(θ) Alignment (k=1)
Mean cos²(θ): 0.005770
Max cos²(θ): 0.011631
Min cos²(θ): 0.000310
Std cos²(θ): 0.004499
Per-expert cos²(θ) values:
| Expert | cos²(θ) | align |
| 0 |
0.004563 |
0.004563 |
| 1 |
0.000310 |
0.000310 |
| 2 |
0.011379 |
0.011379 |
| 3 |
0.011631 |
0.011631 |
| 4 |
0.002967 |
0.002967 |
| 5 |
0.009454 |
0.009454 |
| 6 |
0.001051 |
0.001051 |
| 7 |
0.004804 |
0.004804 |
Detailed Results by K Value
K = 1:
- Mean align: 0.005770
- Mean z-score: 0.68
- Mean effect over random: 0.005526
- Argmax accuracy: 0.500 (random baseline: 0.125)
- Alignment margin: -0.001939
K = 2:
- Mean align: 0.007577
- Mean z-score: 0.40
- Mean effect over random: 0.007089
- Argmax accuracy: 0.250 (random baseline: 0.125)
- Alignment margin: -0.005173
K = 4:
- Mean align: 0.010832
- Mean z-score: 0.19
- Mean effect over random: 0.009855
- Argmax accuracy: 0.125 (random baseline: 0.125)
- Alignment margin: -0.008038
K = 8:
- Mean align: 0.016802
- Mean z-score: -0.56
- Mean effect over random: 0.014849
- Argmax accuracy: 0.000 (random baseline: 0.125)
- Alignment margin: -0.026486
K = 16:
- Mean align: 0.027283
- Mean z-score: -1.03
- Mean effect over random: 0.023376
- Argmax accuracy: 0.000 (random baseline: 0.125)
- Alignment margin: -0.054445
K = 32:
- Mean align: 0.051706
- Mean z-score: -1.13
- Mean effect over random: 0.043893
- Argmax accuracy: 0.000 (random baseline: 0.125)
- Alignment margin: -0.086332
K = 64:
- Mean align: 0.117375
- Mean z-score: -0.72
- Mean effect over random: 0.101750
- Argmax accuracy: 0.125 (random baseline: 0.125)
- Alignment margin: -0.095110
K = 128:
- Mean align: 0.218082
- Mean z-score: -0.65
- Mean effect over random: 0.186832
- Argmax accuracy: 0.125 (random baseline: 0.125)
- Alignment margin: -0.111610
K = 256:
- Mean align: 0.414622
- Mean z-score: -0.42
- Mean effect over random: 0.352122
- Argmax accuracy: 0.250 (random baseline: 0.125)
- Alignment margin: -0.088238
K = 512:
- Mean align: 0.637770
- Mean z-score: -0.27
- Mean effect over random: 0.512770
- Argmax accuracy: 0.125 (random baseline: 0.125)
- Alignment margin: -0.057387
K = 1024:
- Mean align: 0.821602
- Mean z-score: 0.22
- Mean effect over random: 0.571602
- Argmax accuracy: 0.125 (random baseline: 0.125)
- Alignment margin: -0.031759
K = 2048:
- Mean align: 0.929245
- Mean z-score: 0.37
- Mean effect over random: 0.429245
- Argmax accuracy: 0.250 (random baseline: 0.125)
- Alignment margin: -0.014026
K = 4096:
- Mean align: 1.000000
- Mean z-score: 0.00
- Mean effect over random: 0.000000
- Argmax accuracy: 0.125 (random baseline: 0.125)
- Alignment margin: 0.000000
Unique Identification Summary
Interpretation:
- Best unique identification at k=1: Argmax accuracy = 0.500, Margin = -0.001939
- Random baseline: 0.125 (1/8 experts)
- ⚠ Weak unique identification: Alignment is above random baseline but not strong enough for reliable unique identification. Alignment is non-random but not discriminative enough to uniquely identify experts.
- Negative margin (-0.001939): Another expert has higher alignment than the correct one, indicating misidentification.
Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement than just being above chance.
Unique Identification Analysis
Purpose: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. While shuffled-baseline tests rule out random router-expert pairing, argmax and margin tests assess whether routing vectors encode expert identity in a separable and discriminative manner.
Key Questions:
- Does the correct expert achieve the maximum alignment? (Argmax accuracy)
- What is the separation between correct and next-best expert? (Alignment margin)
- Is alignment strong enough to uniquely identify experts from weights alone?
Unique Identification Analysis
Unique Identification Analysis: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. It goes beyond shuffled-baseline tests to assess whether routing vectors encode expert identity in a separable and discriminative manner.
- Argmax Accuracy vs k: Fraction of router vectors where the correct expert achieves the maximum alignment among all experts.
- Formula: For each router vector r_i (assigned to expert i), compute alignment with ALL experts: align(r_i, Expert_j) for all j. Argmax accuracy = (1/n) · Σᵢ [argmax_j align(r_i, Expert_j) == i], where [·] is 1 if true, 0 otherwise.
- Interpretation: Measures whether the correct expert-expert pairing can be uniquely identified from the alignment matrix. Value of 1.0 means perfect identification (every router's correct expert has the highest alignment). Value of 1/n_experts (e.g., 0.125 for 8 experts) means random guessing.
- Computation: (1) For each router vector r_i, compute alignment with all experts' principal subspaces, forming a row of the alignment matrix, (2) Find which expert has maximum alignment: argmax_j align(r_i, Expert_j), (3) Check if argmax equals the correct expert i, (4) Average across all routers to get accuracy.
- Range: [1/n_experts, 1.0]. Value of 1.0 indicates perfect unique identification. Value near 1/n_experts indicates alignment is not discriminative enough to identify experts uniquely.
- Comparison to shuffled baseline: While shuffled baseline tests whether alignment is above chance, argmax accuracy tests whether alignment is strong enough to uniquely identify the correct expert from all possible experts.
- Alignment Margin vs k: Mean difference between correct expert's alignment and the next-best expert's alignment.
- Formula: For each router vector r_i: margin_i = align(r_i, Expert_i) - max_{j≠i} align(r_i, Expert_j). Mean margin = (1/n) · Σᵢ margin_i.
- Interpretation: Measures the separation between correct and incorrect expert alignments. Positive margin means correct expert has higher alignment than all others (unique identification possible). Negative margin means another expert has higher alignment (misidentification).
- Computation: (1) For each router r_i, compute alignments with all experts, (2) Get correct alignment: align_correct = align(r_i, Expert_i), (3) Get maximum among other experts: align_max_other = max_{j≠i} align(r_i, Expert_j), (4) Compute margin = align_correct - align_max_other, (5) Average across all routers.
- Range: Can be negative or positive. Positive values indicate correct expert is best (unique identification). Negative values indicate another expert is better (confusion). Larger positive margins indicate stronger discriminative power.
- Relationship to argmax accuracy: When margin > 0, argmax accuracy = 1.0. When margin < 0, argmax accuracy < 1.0. Margin quantifies the strength of separation even when argmax is correct.
- Alignment Matrix: Full [n_experts × n_experts] matrix where entry (i, j) is the alignment of router vector i with expert j's principal subspace.
- Formula: Matrix[i, j] = align(r_i, Expert_j) = projection energy of router i onto expert j's top-k singular subspace.
- Interpretation: Shows the complete alignment landscape. Diagonal entries (i, i) are correct pairings. Off-diagonal entries show how well routers align with "wrong" experts. For unique identification, diagonal should be the maximum in each row.
- Computation: For each router-expert pair (i, j), compute alignment using the same projection energy formula as the main analysis.
- What to look for: Strong diagonal pattern (diagonal entries are highest in each row) indicates unique identification. Weak diagonal or strong off-diagonal entries indicate confusion between experts.
- Margin Distribution: Histogram of alignment margins across all router vectors for different k values.
- Interpretation: Shows the distribution of discriminative power. Most routers with positive margins indicate good unique identification. Many routers with negative margins indicate frequent misidentification.
- What to look for: Distribution shifted to the right (positive values) indicates strong unique identification. Distribution centered near zero or shifted left indicates weak or no unique identification.
Interpretation Summary:
- Argmax accuracy near 1.0 + positive margins: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a separable and discriminative manner.
- Argmax accuracy above random but < 1.0: Alignment is above chance but not strong enough for perfect unique identification. Some routers may be confused with other experts.
- Argmax accuracy near random baseline: Alignment is not discriminative enough to identify experts uniquely, even though it may be above the shuffled baseline (non-random but not uniquely identifying).
- Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement.
Inter-Expert Orthogonality Analysis
Purpose: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
Method: For each expert, we extract its top singular vector (principal direction) from SVD. We then compute cosine similarity between all pairs of experts' principal directions.
Interpretation:
- Diagonal pattern (high self-similarity, low off-diagonal): Experts are orthogonal and diverse - good!
- High off-diagonal values: Experts are similar to each other - potential expert collapse or redundancy.
- Block patterns: Groups of experts are similar to each other - partial collapse.
Orthogonality Statistics:
- Mean off-diagonal similarity: -0.0603
- Mean absolute off-diagonal similarity: 0.3233
- Max off-diagonal similarity: 0.8906
- Min off-diagonal similarity: -0.8621
- Std off-diagonal similarity: 0.4614
Note: Off-diagonal values exclude self-similarity (diagonal). Lower values indicate better orthogonality.
Inter-Expert Orthogonality Heatmap (k=2)
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
- Method:
- Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
- Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
- Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
- Visualization:
- Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
- Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
- Grid: Grid lines separate cells for better readability.
- Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
- Interpretation:
- Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
- High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
- Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
- Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
- Statistics Reported:
- Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
- Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
- Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
- Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
- Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
- Why This Matters:
- Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
- Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
- Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality Comparison Across k Values
Purpose: This comparison shows how expert orthogonality changes when considering different numbers of principal directions (k). Higher k values consider more singular vectors, potentially revealing more subtle similarities between experts.
Complete Analysis Plots
Comprehensive visualization of all metrics for this run.
Complete Analysis Visualization
See "Plot Explanations" section at the top of this report for detailed information about this plot.
3. Individual Analysis - Run 2
Setup and Configuration
Summary Statistics (averaged across experts)
| k |
align |
delta_vs_shuffle |
z_vs_shuffle |
effect_over_random |
cos_squared |
argmax_accuracy |
alignment_margin |
| 1 |
0.057551 |
0.047232 |
2.554473 |
0.057307 |
0.057551 |
1.000000 |
0.049831 |
| 2 |
0.059381 |
0.047484 |
2.482403 |
0.058893 |
0.000000 |
1.000000 |
0.050548 |
| 4 |
0.068402 |
0.054134 |
2.553551 |
0.067426 |
0.000000 |
1.000000 |
0.056529 |
| 8 |
0.075794 |
0.054138 |
2.426632 |
0.073841 |
0.000000 |
1.000000 |
0.054771 |
| 16 |
0.104669 |
0.068705 |
2.044184 |
0.100762 |
0.000000 |
1.000000 |
0.062931 |
| 32 |
0.139227 |
0.069573 |
1.228228 |
0.131414 |
0.000000 |
0.750000 |
0.056749 |
| 64 |
0.188318 |
0.063255 |
0.907765 |
0.172693 |
0.000000 |
0.250000 |
0.038525 |
| 128 |
0.264706 |
0.052687 |
0.590516 |
0.233456 |
0.000000 |
0.125000 |
0.011865 |
| 256 |
0.416913 |
0.053419 |
0.453827 |
0.354413 |
0.000000 |
0.125000 |
-0.001926 |
| 512 |
0.611965 |
0.060646 |
0.445669 |
0.486965 |
0.000000 |
0.125000 |
0.000992 |
| 1024 |
0.767452 |
0.057432 |
0.421641 |
0.517452 |
0.000000 |
0.250000 |
-0.000392 |
| 2048 |
0.885113 |
0.037467 |
0.398855 |
0.385113 |
0.000000 |
0.250000 |
0.004784 |
| 4096 |
1.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.125000 |
0.000000 |
Cos²(θ) Alignment (k=1)
Mean cos²(θ): 0.057551
Max cos²(θ): 0.070422
Min cos²(θ): 0.042068
Std cos²(θ): 0.011621
Per-expert cos²(θ) values:
| Expert | cos²(θ) | align |
| 0 |
0.054994 |
0.054994 |
| 1 |
0.070422 |
0.070422 |
| 2 |
0.059095 |
0.059095 |
| 3 |
0.042068 |
0.042068 |
| 4 |
0.068530 |
0.068530 |
| 5 |
0.053046 |
0.053046 |
| 6 |
0.070022 |
0.070022 |
| 7 |
0.042234 |
0.042234 |
Detailed Results by K Value
K = 1:
- Mean align: 0.057551
- Mean z-score: 2.55
- Mean effect over random: 0.057307
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.049831
K = 2:
- Mean align: 0.059381
- Mean z-score: 2.48
- Mean effect over random: 0.058893
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.050548
K = 4:
- Mean align: 0.068402
- Mean z-score: 2.55
- Mean effect over random: 0.067426
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.056529
K = 8:
- Mean align: 0.075794
- Mean z-score: 2.43
- Mean effect over random: 0.073841
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.054771
K = 16:
- Mean align: 0.104669
- Mean z-score: 2.04
- Mean effect over random: 0.100762
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.062931
K = 32:
- Mean align: 0.139227
- Mean z-score: 1.23
- Mean effect over random: 0.131414
- Argmax accuracy: 0.750 (random baseline: 0.125)
- Alignment margin: 0.056749
K = 64:
- Mean align: 0.188318
- Mean z-score: 0.91
- Mean effect over random: 0.172693
- Argmax accuracy: 0.250 (random baseline: 0.125)
- Alignment margin: 0.038525
K = 128:
- Mean align: 0.264706
- Mean z-score: 0.59
- Mean effect over random: 0.233456
- Argmax accuracy: 0.125 (random baseline: 0.125)
- Alignment margin: 0.011865
K = 256:
- Mean align: 0.416913
- Mean z-score: 0.45
- Mean effect over random: 0.354413
- Argmax accuracy: 0.125 (random baseline: 0.125)
- Alignment margin: -0.001926
K = 512:
- Mean align: 0.611965
- Mean z-score: 0.45
- Mean effect over random: 0.486965
- Argmax accuracy: 0.125 (random baseline: 0.125)
- Alignment margin: 0.000992
K = 1024:
- Mean align: 0.767452
- Mean z-score: 0.42
- Mean effect over random: 0.517452
- Argmax accuracy: 0.250 (random baseline: 0.125)
- Alignment margin: -0.000392
K = 2048:
- Mean align: 0.885113
- Mean z-score: 0.40
- Mean effect over random: 0.385113
- Argmax accuracy: 0.250 (random baseline: 0.125)
- Alignment margin: 0.004784
K = 4096:
- Mean align: 1.000000
- Mean z-score: 0.00
- Mean effect over random: 0.000000
- Argmax accuracy: 0.125 (random baseline: 0.125)
- Alignment margin: 0.000000
Unique Identification Summary
Interpretation:
- Best unique identification at k=1: Argmax accuracy = 1.000, Margin = 0.049831
- Random baseline: 0.125 (1/8 experts)
- ✓ Excellent unique identification: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a highly separable and discriminative manner.
- Positive margin (0.049831): Correct expert has higher alignment than all other experts, enabling unique identification.
Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement than just being above chance.
Unique Identification Analysis
Purpose: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. While shuffled-baseline tests rule out random router-expert pairing, argmax and margin tests assess whether routing vectors encode expert identity in a separable and discriminative manner.
Key Questions:
- Does the correct expert achieve the maximum alignment? (Argmax accuracy)
- What is the separation between correct and next-best expert? (Alignment margin)
- Is alignment strong enough to uniquely identify experts from weights alone?
Unique Identification Analysis
Unique Identification Analysis: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. It goes beyond shuffled-baseline tests to assess whether routing vectors encode expert identity in a separable and discriminative manner.
- Argmax Accuracy vs k: Fraction of router vectors where the correct expert achieves the maximum alignment among all experts.
- Formula: For each router vector r_i (assigned to expert i), compute alignment with ALL experts: align(r_i, Expert_j) for all j. Argmax accuracy = (1/n) · Σᵢ [argmax_j align(r_i, Expert_j) == i], where [·] is 1 if true, 0 otherwise.
- Interpretation: Measures whether the correct expert-expert pairing can be uniquely identified from the alignment matrix. Value of 1.0 means perfect identification (every router's correct expert has the highest alignment). Value of 1/n_experts (e.g., 0.125 for 8 experts) means random guessing.
- Computation: (1) For each router vector r_i, compute alignment with all experts' principal subspaces, forming a row of the alignment matrix, (2) Find which expert has maximum alignment: argmax_j align(r_i, Expert_j), (3) Check if argmax equals the correct expert i, (4) Average across all routers to get accuracy.
- Range: [1/n_experts, 1.0]. Value of 1.0 indicates perfect unique identification. Value near 1/n_experts indicates alignment is not discriminative enough to identify experts uniquely.
- Comparison to shuffled baseline: While shuffled baseline tests whether alignment is above chance, argmax accuracy tests whether alignment is strong enough to uniquely identify the correct expert from all possible experts.
- Alignment Margin vs k: Mean difference between correct expert's alignment and the next-best expert's alignment.
- Formula: For each router vector r_i: margin_i = align(r_i, Expert_i) - max_{j≠i} align(r_i, Expert_j). Mean margin = (1/n) · Σᵢ margin_i.
- Interpretation: Measures the separation between correct and incorrect expert alignments. Positive margin means correct expert has higher alignment than all others (unique identification possible). Negative margin means another expert has higher alignment (misidentification).
- Computation: (1) For each router r_i, compute alignments with all experts, (2) Get correct alignment: align_correct = align(r_i, Expert_i), (3) Get maximum among other experts: align_max_other = max_{j≠i} align(r_i, Expert_j), (4) Compute margin = align_correct - align_max_other, (5) Average across all routers.
- Range: Can be negative or positive. Positive values indicate correct expert is best (unique identification). Negative values indicate another expert is better (confusion). Larger positive margins indicate stronger discriminative power.
- Relationship to argmax accuracy: When margin > 0, argmax accuracy = 1.0. When margin < 0, argmax accuracy < 1.0. Margin quantifies the strength of separation even when argmax is correct.
- Alignment Matrix: Full [n_experts × n_experts] matrix where entry (i, j) is the alignment of router vector i with expert j's principal subspace.
- Formula: Matrix[i, j] = align(r_i, Expert_j) = projection energy of router i onto expert j's top-k singular subspace.
- Interpretation: Shows the complete alignment landscape. Diagonal entries (i, i) are correct pairings. Off-diagonal entries show how well routers align with "wrong" experts. For unique identification, diagonal should be the maximum in each row.
- Computation: For each router-expert pair (i, j), compute alignment using the same projection energy formula as the main analysis.
- What to look for: Strong diagonal pattern (diagonal entries are highest in each row) indicates unique identification. Weak diagonal or strong off-diagonal entries indicate confusion between experts.
- Margin Distribution: Histogram of alignment margins across all router vectors for different k values.
- Interpretation: Shows the distribution of discriminative power. Most routers with positive margins indicate good unique identification. Many routers with negative margins indicate frequent misidentification.
- What to look for: Distribution shifted to the right (positive values) indicates strong unique identification. Distribution centered near zero or shifted left indicates weak or no unique identification.
Interpretation Summary:
- Argmax accuracy near 1.0 + positive margins: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a separable and discriminative manner.
- Argmax accuracy above random but < 1.0: Alignment is above chance but not strong enough for perfect unique identification. Some routers may be confused with other experts.
- Argmax accuracy near random baseline: Alignment is not discriminative enough to identify experts uniquely, even though it may be above the shuffled baseline (non-random but not uniquely identifying).
- Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement.
Inter-Expert Orthogonality Analysis
Purpose: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
Method: For each expert, we extract its top singular vector (principal direction) from SVD. We then compute cosine similarity between all pairs of experts' principal directions.
Interpretation:
- Diagonal pattern (high self-similarity, low off-diagonal): Experts are orthogonal and diverse - good!
- High off-diagonal values: Experts are similar to each other - potential expert collapse or redundancy.
- Block patterns: Groups of experts are similar to each other - partial collapse.
Orthogonality Statistics:
- Mean off-diagonal similarity: 0.3233
- Mean absolute off-diagonal similarity: 0.3306
- Max off-diagonal similarity: 0.8950
- Min off-diagonal similarity: -0.0241
- Std off-diagonal similarity: 0.3709
Note: Off-diagonal values exclude self-similarity (diagonal). Lower values indicate better orthogonality.
Inter-Expert Orthogonality Heatmap (k=2)
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
- Method:
- Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
- Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
- Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
- Visualization:
- Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
- Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
- Grid: Grid lines separate cells for better readability.
- Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
- Interpretation:
- Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
- High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
- Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
- Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
- Statistics Reported:
- Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
- Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
- Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
- Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
- Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
- Why This Matters:
- Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
- Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
- Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality Comparison Across k Values
Purpose: This comparison shows how expert orthogonality changes when considering different numbers of principal directions (k). Higher k values consider more singular vectors, potentially revealing more subtle similarities between experts.
Complete Analysis Plots
Comprehensive visualization of all metrics for this run.
Complete Analysis Visualization
See "Plot Explanations" section at the top of this report for detailed information about this plot.
4. Individual Analysis - Run 3
Setup and Configuration
Summary Statistics (averaged across experts)
| k |
align |
delta_vs_shuffle |
z_vs_shuffle |
effect_over_random |
cos_squared |
argmax_accuracy |
alignment_margin |
| 1 |
0.064961 |
0.051988 |
2.218826 |
0.064717 |
0.064961 |
0.875000 |
0.051164 |
| 2 |
0.074739 |
0.058293 |
2.284018 |
0.074251 |
0.000000 |
1.000000 |
0.058322 |
| 4 |
0.082144 |
0.062073 |
2.388455 |
0.081167 |
0.000000 |
1.000000 |
0.062144 |
| 8 |
0.091984 |
0.066795 |
2.488270 |
0.090031 |
0.000000 |
1.000000 |
0.066244 |
| 16 |
0.111207 |
0.074492 |
2.354253 |
0.107300 |
0.000000 |
1.000000 |
0.073237 |
| 32 |
0.136211 |
0.085793 |
2.295637 |
0.128399 |
0.000000 |
1.000000 |
0.083369 |
| 64 |
0.172688 |
0.101731 |
2.336120 |
0.157063 |
0.000000 |
1.000000 |
0.101138 |
| 128 |
0.219534 |
0.118886 |
2.213562 |
0.188284 |
0.000000 |
1.000000 |
0.116449 |
| 256 |
0.284807 |
0.144395 |
2.246814 |
0.222307 |
0.000000 |
1.000000 |
0.137058 |
| 512 |
0.372220 |
0.169235 |
2.153636 |
0.247220 |
0.000000 |
1.000000 |
0.146158 |
| 1024 |
0.493685 |
0.183328 |
1.938368 |
0.243685 |
0.000000 |
1.000000 |
0.129647 |
| 2048 |
0.676953 |
0.169473 |
1.859634 |
0.176953 |
0.000000 |
1.000000 |
0.093153 |
| 4096 |
1.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.125000 |
0.000000 |
Cos²(θ) Alignment (k=1)
Mean cos²(θ): 0.064961
Max cos²(θ): 0.097719
Min cos²(θ): 0.002639
Std cos²(θ): 0.029249
Per-expert cos²(θ) values:
| Expert | cos²(θ) | align |
| 0 |
0.073076 |
0.073076 |
| 1 |
0.090944 |
0.090944 |
| 2 |
0.068927 |
0.068927 |
| 3 |
0.002639 |
0.002639 |
| 4 |
0.097719 |
0.097719 |
| 5 |
0.062334 |
0.062334 |
| 6 |
0.073216 |
0.073216 |
| 7 |
0.050834 |
0.050834 |
Detailed Results by K Value
K = 1:
- Mean align: 0.064961
- Mean z-score: 2.22
- Mean effect over random: 0.064717
- Argmax accuracy: 0.875 (random baseline: 0.125)
- Alignment margin: 0.051164
K = 2:
- Mean align: 0.074739
- Mean z-score: 2.28
- Mean effect over random: 0.074251
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.058322
K = 4:
- Mean align: 0.082144
- Mean z-score: 2.39
- Mean effect over random: 0.081167
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.062144
K = 8:
- Mean align: 0.091984
- Mean z-score: 2.49
- Mean effect over random: 0.090031
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.066244
K = 16:
- Mean align: 0.111207
- Mean z-score: 2.35
- Mean effect over random: 0.107300
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.073237
K = 32:
- Mean align: 0.136211
- Mean z-score: 2.30
- Mean effect over random: 0.128399
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.083369
K = 64:
- Mean align: 0.172688
- Mean z-score: 2.34
- Mean effect over random: 0.157063
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.101138
K = 128:
- Mean align: 0.219534
- Mean z-score: 2.21
- Mean effect over random: 0.188284
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.116449
K = 256:
- Mean align: 0.284807
- Mean z-score: 2.25
- Mean effect over random: 0.222307
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.137058
K = 512:
- Mean align: 0.372220
- Mean z-score: 2.15
- Mean effect over random: 0.247220
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.146158
K = 1024:
- Mean align: 0.493685
- Mean z-score: 1.94
- Mean effect over random: 0.243685
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.129647
K = 2048:
- Mean align: 0.676953
- Mean z-score: 1.86
- Mean effect over random: 0.176953
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.093153
K = 4096:
- Mean align: 1.000000
- Mean z-score: 0.00
- Mean effect over random: 0.000000
- Argmax accuracy: 0.125 (random baseline: 0.125)
- Alignment margin: 0.000000
Unique Identification Summary
Interpretation:
- Best unique identification at k=2: Argmax accuracy = 1.000, Margin = 0.058322
- Random baseline: 0.125 (1/8 experts)
- ✓ Excellent unique identification: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a highly separable and discriminative manner.
- Positive margin (0.058322): Correct expert has higher alignment than all other experts, enabling unique identification.
Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement than just being above chance.
Unique Identification Analysis
Purpose: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. While shuffled-baseline tests rule out random router-expert pairing, argmax and margin tests assess whether routing vectors encode expert identity in a separable and discriminative manner.
Key Questions:
- Does the correct expert achieve the maximum alignment? (Argmax accuracy)
- What is the separation between correct and next-best expert? (Alignment margin)
- Is alignment strong enough to uniquely identify experts from weights alone?
Unique Identification Analysis
Unique Identification Analysis: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. It goes beyond shuffled-baseline tests to assess whether routing vectors encode expert identity in a separable and discriminative manner.
- Argmax Accuracy vs k: Fraction of router vectors where the correct expert achieves the maximum alignment among all experts.
- Formula: For each router vector r_i (assigned to expert i), compute alignment with ALL experts: align(r_i, Expert_j) for all j. Argmax accuracy = (1/n) · Σᵢ [argmax_j align(r_i, Expert_j) == i], where [·] is 1 if true, 0 otherwise.
- Interpretation: Measures whether the correct expert-expert pairing can be uniquely identified from the alignment matrix. Value of 1.0 means perfect identification (every router's correct expert has the highest alignment). Value of 1/n_experts (e.g., 0.125 for 8 experts) means random guessing.
- Computation: (1) For each router vector r_i, compute alignment with all experts' principal subspaces, forming a row of the alignment matrix, (2) Find which expert has maximum alignment: argmax_j align(r_i, Expert_j), (3) Check if argmax equals the correct expert i, (4) Average across all routers to get accuracy.
- Range: [1/n_experts, 1.0]. Value of 1.0 indicates perfect unique identification. Value near 1/n_experts indicates alignment is not discriminative enough to identify experts uniquely.
- Comparison to shuffled baseline: While shuffled baseline tests whether alignment is above chance, argmax accuracy tests whether alignment is strong enough to uniquely identify the correct expert from all possible experts.
- Alignment Margin vs k: Mean difference between correct expert's alignment and the next-best expert's alignment.
- Formula: For each router vector r_i: margin_i = align(r_i, Expert_i) - max_{j≠i} align(r_i, Expert_j). Mean margin = (1/n) · Σᵢ margin_i.
- Interpretation: Measures the separation between correct and incorrect expert alignments. Positive margin means correct expert has higher alignment than all others (unique identification possible). Negative margin means another expert has higher alignment (misidentification).
- Computation: (1) For each router r_i, compute alignments with all experts, (2) Get correct alignment: align_correct = align(r_i, Expert_i), (3) Get maximum among other experts: align_max_other = max_{j≠i} align(r_i, Expert_j), (4) Compute margin = align_correct - align_max_other, (5) Average across all routers.
- Range: Can be negative or positive. Positive values indicate correct expert is best (unique identification). Negative values indicate another expert is better (confusion). Larger positive margins indicate stronger discriminative power.
- Relationship to argmax accuracy: When margin > 0, argmax accuracy = 1.0. When margin < 0, argmax accuracy < 1.0. Margin quantifies the strength of separation even when argmax is correct.
- Alignment Matrix: Full [n_experts × n_experts] matrix where entry (i, j) is the alignment of router vector i with expert j's principal subspace.
- Formula: Matrix[i, j] = align(r_i, Expert_j) = projection energy of router i onto expert j's top-k singular subspace.
- Interpretation: Shows the complete alignment landscape. Diagonal entries (i, i) are correct pairings. Off-diagonal entries show how well routers align with "wrong" experts. For unique identification, diagonal should be the maximum in each row.
- Computation: For each router-expert pair (i, j), compute alignment using the same projection energy formula as the main analysis.
- What to look for: Strong diagonal pattern (diagonal entries are highest in each row) indicates unique identification. Weak diagonal or strong off-diagonal entries indicate confusion between experts.
- Margin Distribution: Histogram of alignment margins across all router vectors for different k values.
- Interpretation: Shows the distribution of discriminative power. Most routers with positive margins indicate good unique identification. Many routers with negative margins indicate frequent misidentification.
- What to look for: Distribution shifted to the right (positive values) indicates strong unique identification. Distribution centered near zero or shifted left indicates weak or no unique identification.
Interpretation Summary:
- Argmax accuracy near 1.0 + positive margins: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a separable and discriminative manner.
- Argmax accuracy above random but < 1.0: Alignment is above chance but not strong enough for perfect unique identification. Some routers may be confused with other experts.
- Argmax accuracy near random baseline: Alignment is not discriminative enough to identify experts uniquely, even though it may be above the shuffled baseline (non-random but not uniquely identifying).
- Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement.
Inter-Expert Orthogonality Analysis
Purpose: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
Method: For each expert, we extract its top singular vector (principal direction) from SVD. We then compute cosine similarity between all pairs of experts' principal directions.
Interpretation:
- Diagonal pattern (high self-similarity, low off-diagonal): Experts are orthogonal and diverse - good!
- High off-diagonal values: Experts are similar to each other - potential expert collapse or redundancy.
- Block patterns: Groups of experts are similar to each other - partial collapse.
Orthogonality Statistics:
- Mean off-diagonal similarity: 0.1829
- Mean absolute off-diagonal similarity: 0.3728
- Max off-diagonal similarity: 0.6411
- Min off-diagonal similarity: -0.7478
- Std off-diagonal similarity: 0.3644
Note: Off-diagonal values exclude self-similarity (diagonal). Lower values indicate better orthogonality.
Inter-Expert Orthogonality Heatmap (k=2)
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
- Method:
- Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
- Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
- Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
- Visualization:
- Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
- Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
- Grid: Grid lines separate cells for better readability.
- Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
- Interpretation:
- Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
- High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
- Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
- Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
- Statistics Reported:
- Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
- Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
- Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
- Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
- Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
- Why This Matters:
- Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
- Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
- Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality Comparison Across k Values
Purpose: This comparison shows how expert orthogonality changes when considering different numbers of principal directions (k). Higher k values consider more singular vectors, potentially revealing more subtle similarities between experts.
Complete Analysis Plots
Comprehensive visualization of all metrics for this run.
Complete Analysis Visualization
See "Plot Explanations" section at the top of this report for detailed information about this plot.
5. Individual Analysis - Run 4
Setup and Configuration
Summary Statistics (averaged across experts)
| k |
align |
delta_vs_shuffle |
z_vs_shuffle |
effect_over_random |
cos_squared |
argmax_accuracy |
alignment_margin |
| 1 |
0.086661 |
0.072986 |
2.512674 |
0.086417 |
0.086661 |
1.000000 |
0.080441 |
| 2 |
0.088815 |
0.075093 |
2.719778 |
0.088327 |
0.000000 |
1.000000 |
0.080075 |
| 4 |
0.102974 |
0.083705 |
2.541455 |
0.101998 |
0.000000 |
1.000000 |
0.089314 |
| 8 |
0.118629 |
0.091889 |
2.453338 |
0.116676 |
0.000000 |
1.000000 |
0.098466 |
| 16 |
0.142135 |
0.105862 |
2.694624 |
0.138229 |
0.000000 |
1.000000 |
0.110616 |
| 32 |
0.174238 |
0.119605 |
2.677794 |
0.166426 |
0.000000 |
1.000000 |
0.123853 |
| 64 |
0.217419 |
0.131549 |
2.434596 |
0.201794 |
0.000000 |
1.000000 |
0.139225 |
| 128 |
0.285893 |
0.167746 |
2.627440 |
0.254643 |
0.000000 |
1.000000 |
0.174368 |
| 256 |
0.375532 |
0.204085 |
2.376817 |
0.313032 |
0.000000 |
1.000000 |
0.215717 |
| 512 |
0.476727 |
0.248076 |
2.649691 |
0.351727 |
0.000000 |
1.000000 |
0.250286 |
| 1024 |
0.602968 |
0.262075 |
2.457562 |
0.352968 |
0.000000 |
1.000000 |
0.261166 |
| 2048 |
0.763862 |
0.215525 |
2.429688 |
0.263862 |
0.000000 |
1.000000 |
0.206376 |
| 4096 |
1.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.125000 |
0.000000 |
Cos²(θ) Alignment (k=1)
Mean cos²(θ): 0.086661
Max cos²(θ): 0.102524
Min cos²(θ): 0.059184
Std cos²(θ): 0.016729
Per-expert cos²(θ) values:
| Expert | cos²(θ) | align |
| 0 |
0.059184 |
0.059184 |
| 1 |
0.093662 |
0.093662 |
| 2 |
0.098208 |
0.098208 |
| 3 |
0.102524 |
0.102524 |
| 4 |
0.102472 |
0.102471 |
| 5 |
0.087545 |
0.087545 |
| 6 |
0.085803 |
0.085803 |
| 7 |
0.063891 |
0.063891 |
Detailed Results by K Value
K = 1:
- Mean align: 0.086661
- Mean z-score: 2.51
- Mean effect over random: 0.086417
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.080441
K = 2:
- Mean align: 0.088815
- Mean z-score: 2.72
- Mean effect over random: 0.088327
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.080075
K = 4:
- Mean align: 0.102974
- Mean z-score: 2.54
- Mean effect over random: 0.101998
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.089314
K = 8:
- Mean align: 0.118629
- Mean z-score: 2.45
- Mean effect over random: 0.116676
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.098466
K = 16:
- Mean align: 0.142135
- Mean z-score: 2.69
- Mean effect over random: 0.138229
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.110616
K = 32:
- Mean align: 0.174238
- Mean z-score: 2.68
- Mean effect over random: 0.166426
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.123853
K = 64:
- Mean align: 0.217419
- Mean z-score: 2.43
- Mean effect over random: 0.201794
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.139225
K = 128:
- Mean align: 0.285893
- Mean z-score: 2.63
- Mean effect over random: 0.254643
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.174368
K = 256:
- Mean align: 0.375532
- Mean z-score: 2.38
- Mean effect over random: 0.313032
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.215717
K = 512:
- Mean align: 0.476727
- Mean z-score: 2.65
- Mean effect over random: 0.351727
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.250286
K = 1024:
- Mean align: 0.602968
- Mean z-score: 2.46
- Mean effect over random: 0.352968
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.261166
K = 2048:
- Mean align: 0.763862
- Mean z-score: 2.43
- Mean effect over random: 0.263862
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.206376
K = 4096:
- Mean align: 1.000000
- Mean z-score: 0.00
- Mean effect over random: 0.000000
- Argmax accuracy: 0.125 (random baseline: 0.125)
- Alignment margin: 0.000000
Unique Identification Summary
Interpretation:
- Best unique identification at k=1: Argmax accuracy = 1.000, Margin = 0.080441
- Random baseline: 0.125 (1/8 experts)
- ✓ Excellent unique identification: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a highly separable and discriminative manner.
- Positive margin (0.080441): Correct expert has higher alignment than all other experts, enabling unique identification.
Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement than just being above chance.
Unique Identification Analysis
Purpose: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. While shuffled-baseline tests rule out random router-expert pairing, argmax and margin tests assess whether routing vectors encode expert identity in a separable and discriminative manner.
Key Questions:
- Does the correct expert achieve the maximum alignment? (Argmax accuracy)
- What is the separation between correct and next-best expert? (Alignment margin)
- Is alignment strong enough to uniquely identify experts from weights alone?
Unique Identification Analysis
Unique Identification Analysis: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. It goes beyond shuffled-baseline tests to assess whether routing vectors encode expert identity in a separable and discriminative manner.
- Argmax Accuracy vs k: Fraction of router vectors where the correct expert achieves the maximum alignment among all experts.
- Formula: For each router vector r_i (assigned to expert i), compute alignment with ALL experts: align(r_i, Expert_j) for all j. Argmax accuracy = (1/n) · Σᵢ [argmax_j align(r_i, Expert_j) == i], where [·] is 1 if true, 0 otherwise.
- Interpretation: Measures whether the correct expert-expert pairing can be uniquely identified from the alignment matrix. Value of 1.0 means perfect identification (every router's correct expert has the highest alignment). Value of 1/n_experts (e.g., 0.125 for 8 experts) means random guessing.
- Computation: (1) For each router vector r_i, compute alignment with all experts' principal subspaces, forming a row of the alignment matrix, (2) Find which expert has maximum alignment: argmax_j align(r_i, Expert_j), (3) Check if argmax equals the correct expert i, (4) Average across all routers to get accuracy.
- Range: [1/n_experts, 1.0]. Value of 1.0 indicates perfect unique identification. Value near 1/n_experts indicates alignment is not discriminative enough to identify experts uniquely.
- Comparison to shuffled baseline: While shuffled baseline tests whether alignment is above chance, argmax accuracy tests whether alignment is strong enough to uniquely identify the correct expert from all possible experts.
- Alignment Margin vs k: Mean difference between correct expert's alignment and the next-best expert's alignment.
- Formula: For each router vector r_i: margin_i = align(r_i, Expert_i) - max_{j≠i} align(r_i, Expert_j). Mean margin = (1/n) · Σᵢ margin_i.
- Interpretation: Measures the separation between correct and incorrect expert alignments. Positive margin means correct expert has higher alignment than all others (unique identification possible). Negative margin means another expert has higher alignment (misidentification).
- Computation: (1) For each router r_i, compute alignments with all experts, (2) Get correct alignment: align_correct = align(r_i, Expert_i), (3) Get maximum among other experts: align_max_other = max_{j≠i} align(r_i, Expert_j), (4) Compute margin = align_correct - align_max_other, (5) Average across all routers.
- Range: Can be negative or positive. Positive values indicate correct expert is best (unique identification). Negative values indicate another expert is better (confusion). Larger positive margins indicate stronger discriminative power.
- Relationship to argmax accuracy: When margin > 0, argmax accuracy = 1.0. When margin < 0, argmax accuracy < 1.0. Margin quantifies the strength of separation even when argmax is correct.
- Alignment Matrix: Full [n_experts × n_experts] matrix where entry (i, j) is the alignment of router vector i with expert j's principal subspace.
- Formula: Matrix[i, j] = align(r_i, Expert_j) = projection energy of router i onto expert j's top-k singular subspace.
- Interpretation: Shows the complete alignment landscape. Diagonal entries (i, i) are correct pairings. Off-diagonal entries show how well routers align with "wrong" experts. For unique identification, diagonal should be the maximum in each row.
- Computation: For each router-expert pair (i, j), compute alignment using the same projection energy formula as the main analysis.
- What to look for: Strong diagonal pattern (diagonal entries are highest in each row) indicates unique identification. Weak diagonal or strong off-diagonal entries indicate confusion between experts.
- Margin Distribution: Histogram of alignment margins across all router vectors for different k values.
- Interpretation: Shows the distribution of discriminative power. Most routers with positive margins indicate good unique identification. Many routers with negative margins indicate frequent misidentification.
- What to look for: Distribution shifted to the right (positive values) indicates strong unique identification. Distribution centered near zero or shifted left indicates weak or no unique identification.
Interpretation Summary:
- Argmax accuracy near 1.0 + positive margins: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a separable and discriminative manner.
- Argmax accuracy above random but < 1.0: Alignment is above chance but not strong enough for perfect unique identification. Some routers may be confused with other experts.
- Argmax accuracy near random baseline: Alignment is not discriminative enough to identify experts uniquely, even though it may be above the shuffled baseline (non-random but not uniquely identifying).
- Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement.
Inter-Expert Orthogonality Analysis
Purpose: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
Method: For each expert, we extract its top singular vector (principal direction) from SVD. We then compute cosine similarity between all pairs of experts' principal directions.
Interpretation:
- Diagonal pattern (high self-similarity, low off-diagonal): Experts are orthogonal and diverse - good!
- High off-diagonal values: Experts are similar to each other - potential expert collapse or redundancy.
- Block patterns: Groups of experts are similar to each other - partial collapse.
Orthogonality Statistics:
- Mean off-diagonal similarity: -0.0458
- Mean absolute off-diagonal similarity: 0.4108
- Max off-diagonal similarity: 0.6760
- Min off-diagonal similarity: -0.6596
- Std off-diagonal similarity: 0.4340
Note: Off-diagonal values exclude self-similarity (diagonal). Lower values indicate better orthogonality.
Inter-Expert Orthogonality Heatmap (k=2)
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
- Method:
- Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
- Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
- Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
- Visualization:
- Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
- Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
- Grid: Grid lines separate cells for better readability.
- Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
- Interpretation:
- Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
- High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
- Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
- Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
- Statistics Reported:
- Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
- Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
- Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
- Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
- Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
- Why This Matters:
- Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
- Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
- Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality Comparison Across k Values
Purpose: This comparison shows how expert orthogonality changes when considering different numbers of principal directions (k). Higher k values consider more singular vectors, potentially revealing more subtle similarities between experts.
Complete Analysis Plots
Comprehensive visualization of all metrics for this run.
Complete Analysis Visualization
See "Plot Explanations" section at the top of this report for detailed information about this plot.
6. Individual Analysis - Run 5
Setup and Configuration
Summary Statistics (averaged across experts)
| k |
align |
delta_vs_shuffle |
z_vs_shuffle |
effect_over_random |
cos_squared |
argmax_accuracy |
alignment_margin |
| 1 |
0.044943 |
0.037256 |
2.469307 |
0.044699 |
0.044943 |
1.000000 |
0.040451 |
| 2 |
0.284024 |
0.241433 |
2.583244 |
0.283536 |
0.000000 |
1.000000 |
0.267528 |
| 4 |
0.318368 |
0.266162 |
2.487106 |
0.317391 |
0.000000 |
1.000000 |
0.300490 |
| 8 |
0.339890 |
0.288258 |
2.652657 |
0.337937 |
0.000000 |
1.000000 |
0.319115 |
| 16 |
0.362707 |
0.299835 |
2.477195 |
0.358801 |
0.000000 |
1.000000 |
0.337860 |
| 32 |
0.383622 |
0.316346 |
2.569243 |
0.375809 |
0.000000 |
1.000000 |
0.352929 |
| 64 |
0.401119 |
0.323436 |
2.602394 |
0.385494 |
0.000000 |
1.000000 |
0.357489 |
| 128 |
0.421271 |
0.322301 |
2.600011 |
0.390021 |
0.000000 |
1.000000 |
0.355564 |
| 256 |
0.450908 |
0.312535 |
2.582455 |
0.388408 |
0.000000 |
1.000000 |
0.338534 |
| 512 |
0.491796 |
0.282973 |
2.661499 |
0.366796 |
0.000000 |
1.000000 |
0.296593 |
| 1024 |
0.560068 |
0.214405 |
2.610082 |
0.310068 |
0.000000 |
1.000000 |
0.215217 |
| 2048 |
0.684269 |
0.095039 |
2.301127 |
0.184269 |
0.000000 |
1.000000 |
0.082996 |
| 4096 |
1.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.125000 |
0.000000 |
Cos²(θ) Alignment (k=1)
Mean cos²(θ): 0.044943
Max cos²(θ): 0.069426
Min cos²(θ): 0.021023
Std cos²(θ): 0.014521
Per-expert cos²(θ) values:
| Expert | cos²(θ) | align |
| 0 |
0.049431 |
0.049431 |
| 1 |
0.021023 |
0.021023 |
| 2 |
0.069426 |
0.069426 |
| 3 |
0.041228 |
0.041228 |
| 4 |
0.042395 |
0.042395 |
| 5 |
0.054745 |
0.054745 |
| 6 |
0.032556 |
0.032556 |
| 7 |
0.048737 |
0.048737 |
Detailed Results by K Value
K = 1:
- Mean align: 0.044943
- Mean z-score: 2.47
- Mean effect over random: 0.044699
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.040451
K = 2:
- Mean align: 0.284024
- Mean z-score: 2.58
- Mean effect over random: 0.283536
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.267528
K = 4:
- Mean align: 0.318368
- Mean z-score: 2.49
- Mean effect over random: 0.317391
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.300490
K = 8:
- Mean align: 0.339890
- Mean z-score: 2.65
- Mean effect over random: 0.337937
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.319115
K = 16:
- Mean align: 0.362707
- Mean z-score: 2.48
- Mean effect over random: 0.358801
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.337860
K = 32:
- Mean align: 0.383622
- Mean z-score: 2.57
- Mean effect over random: 0.375809
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.352929
K = 64:
- Mean align: 0.401119
- Mean z-score: 2.60
- Mean effect over random: 0.385494
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.357489
K = 128:
- Mean align: 0.421271
- Mean z-score: 2.60
- Mean effect over random: 0.390021
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.355564
K = 256:
- Mean align: 0.450908
- Mean z-score: 2.58
- Mean effect over random: 0.388408
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.338534
K = 512:
- Mean align: 0.491796
- Mean z-score: 2.66
- Mean effect over random: 0.366796
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.296593
K = 1024:
- Mean align: 0.560068
- Mean z-score: 2.61
- Mean effect over random: 0.310068
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.215217
K = 2048:
- Mean align: 0.684269
- Mean z-score: 2.30
- Mean effect over random: 0.184269
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.082996
K = 4096:
- Mean align: 1.000000
- Mean z-score: 0.00
- Mean effect over random: 0.000000
- Argmax accuracy: 0.125 (random baseline: 0.125)
- Alignment margin: 0.000000
Unique Identification Summary
Interpretation:
- Best unique identification at k=1: Argmax accuracy = 1.000, Margin = 0.040451
- Random baseline: 0.125 (1/8 experts)
- ✓ Excellent unique identification: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a highly separable and discriminative manner.
- Positive margin (0.040451): Correct expert has higher alignment than all other experts, enabling unique identification.
Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement than just being above chance.
Unique Identification Analysis
Purpose: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. While shuffled-baseline tests rule out random router-expert pairing, argmax and margin tests assess whether routing vectors encode expert identity in a separable and discriminative manner.
Key Questions:
- Does the correct expert achieve the maximum alignment? (Argmax accuracy)
- What is the separation between correct and next-best expert? (Alignment margin)
- Is alignment strong enough to uniquely identify experts from weights alone?
Unique Identification Analysis
Unique Identification Analysis: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. It goes beyond shuffled-baseline tests to assess whether routing vectors encode expert identity in a separable and discriminative manner.
- Argmax Accuracy vs k: Fraction of router vectors where the correct expert achieves the maximum alignment among all experts.
- Formula: For each router vector r_i (assigned to expert i), compute alignment with ALL experts: align(r_i, Expert_j) for all j. Argmax accuracy = (1/n) · Σᵢ [argmax_j align(r_i, Expert_j) == i], where [·] is 1 if true, 0 otherwise.
- Interpretation: Measures whether the correct expert-expert pairing can be uniquely identified from the alignment matrix. Value of 1.0 means perfect identification (every router's correct expert has the highest alignment). Value of 1/n_experts (e.g., 0.125 for 8 experts) means random guessing.
- Computation: (1) For each router vector r_i, compute alignment with all experts' principal subspaces, forming a row of the alignment matrix, (2) Find which expert has maximum alignment: argmax_j align(r_i, Expert_j), (3) Check if argmax equals the correct expert i, (4) Average across all routers to get accuracy.
- Range: [1/n_experts, 1.0]. Value of 1.0 indicates perfect unique identification. Value near 1/n_experts indicates alignment is not discriminative enough to identify experts uniquely.
- Comparison to shuffled baseline: While shuffled baseline tests whether alignment is above chance, argmax accuracy tests whether alignment is strong enough to uniquely identify the correct expert from all possible experts.
- Alignment Margin vs k: Mean difference between correct expert's alignment and the next-best expert's alignment.
- Formula: For each router vector r_i: margin_i = align(r_i, Expert_i) - max_{j≠i} align(r_i, Expert_j). Mean margin = (1/n) · Σᵢ margin_i.
- Interpretation: Measures the separation between correct and incorrect expert alignments. Positive margin means correct expert has higher alignment than all others (unique identification possible). Negative margin means another expert has higher alignment (misidentification).
- Computation: (1) For each router r_i, compute alignments with all experts, (2) Get correct alignment: align_correct = align(r_i, Expert_i), (3) Get maximum among other experts: align_max_other = max_{j≠i} align(r_i, Expert_j), (4) Compute margin = align_correct - align_max_other, (5) Average across all routers.
- Range: Can be negative or positive. Positive values indicate correct expert is best (unique identification). Negative values indicate another expert is better (confusion). Larger positive margins indicate stronger discriminative power.
- Relationship to argmax accuracy: When margin > 0, argmax accuracy = 1.0. When margin < 0, argmax accuracy < 1.0. Margin quantifies the strength of separation even when argmax is correct.
- Alignment Matrix: Full [n_experts × n_experts] matrix where entry (i, j) is the alignment of router vector i with expert j's principal subspace.
- Formula: Matrix[i, j] = align(r_i, Expert_j) = projection energy of router i onto expert j's top-k singular subspace.
- Interpretation: Shows the complete alignment landscape. Diagonal entries (i, i) are correct pairings. Off-diagonal entries show how well routers align with "wrong" experts. For unique identification, diagonal should be the maximum in each row.
- Computation: For each router-expert pair (i, j), compute alignment using the same projection energy formula as the main analysis.
- What to look for: Strong diagonal pattern (diagonal entries are highest in each row) indicates unique identification. Weak diagonal or strong off-diagonal entries indicate confusion between experts.
- Margin Distribution: Histogram of alignment margins across all router vectors for different k values.
- Interpretation: Shows the distribution of discriminative power. Most routers with positive margins indicate good unique identification. Many routers with negative margins indicate frequent misidentification.
- What to look for: Distribution shifted to the right (positive values) indicates strong unique identification. Distribution centered near zero or shifted left indicates weak or no unique identification.
Interpretation Summary:
- Argmax accuracy near 1.0 + positive margins: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a separable and discriminative manner.
- Argmax accuracy above random but < 1.0: Alignment is above chance but not strong enough for perfect unique identification. Some routers may be confused with other experts.
- Argmax accuracy near random baseline: Alignment is not discriminative enough to identify experts uniquely, even though it may be above the shuffled baseline (non-random but not uniquely identifying).
- Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement.
Inter-Expert Orthogonality Analysis
Purpose: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
Method: For each expert, we extract its top singular vector (principal direction) from SVD. We then compute cosine similarity between all pairs of experts' principal directions.
Interpretation:
- Diagonal pattern (high self-similarity, low off-diagonal): Experts are orthogonal and diverse - good!
- High off-diagonal values: Experts are similar to each other - potential expert collapse or redundancy.
- Block patterns: Groups of experts are similar to each other - partial collapse.
Orthogonality Statistics:
- Mean off-diagonal similarity: 0.2060
- Mean absolute off-diagonal similarity: 0.3981
- Max off-diagonal similarity: 0.7257
- Min off-diagonal similarity: -0.7446
- Std off-diagonal similarity: 0.4268
Note: Off-diagonal values exclude self-similarity (diagonal). Lower values indicate better orthogonality.
Inter-Expert Orthogonality Heatmap (k=2)
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
- Method:
- Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
- Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
- Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
- Visualization:
- Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
- Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
- Grid: Grid lines separate cells for better readability.
- Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
- Interpretation:
- Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
- High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
- Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
- Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
- Statistics Reported:
- Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
- Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
- Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
- Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
- Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
- Why This Matters:
- Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
- Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
- Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality Comparison Across k Values
Purpose: This comparison shows how expert orthogonality changes when considering different numbers of principal directions (k). Higher k values consider more singular vectors, potentially revealing more subtle similarities between experts.
Complete Analysis Plots
Comprehensive visualization of all metrics for this run.
Complete Analysis Visualization
See "Plot Explanations" section at the top of this report for detailed information about this plot.
7. Individual Analysis - Run 6
Setup and Configuration
Summary Statistics (averaged across experts)
| k |
align |
delta_vs_shuffle |
z_vs_shuffle |
effect_over_random |
cos_squared |
argmax_accuracy |
alignment_margin |
| 1 |
0.023197 |
0.016761 |
1.491834 |
0.022952 |
0.023197 |
0.500000 |
0.011800 |
| 2 |
0.259938 |
0.214391 |
2.386393 |
0.259449 |
0.000000 |
1.000000 |
0.226243 |
| 4 |
0.286469 |
0.237247 |
2.607388 |
0.285492 |
0.000000 |
1.000000 |
0.248632 |
| 8 |
0.307733 |
0.242406 |
2.331280 |
0.305780 |
0.000000 |
1.000000 |
0.258187 |
| 16 |
0.332480 |
0.256799 |
2.505764 |
0.328573 |
0.000000 |
1.000000 |
0.260972 |
| 32 |
0.367138 |
0.268525 |
2.648007 |
0.359326 |
0.000000 |
1.000000 |
0.268582 |
| 64 |
0.408700 |
0.270629 |
2.431572 |
0.393075 |
0.000000 |
1.000000 |
0.273820 |
| 128 |
0.453305 |
0.279122 |
2.549925 |
0.422055 |
0.000000 |
1.000000 |
0.278289 |
| 256 |
0.491464 |
0.273296 |
2.560864 |
0.428964 |
0.000000 |
1.000000 |
0.268457 |
| 512 |
0.530967 |
0.249536 |
2.353941 |
0.405967 |
0.000000 |
1.000000 |
0.242936 |
| 1024 |
0.591075 |
0.211292 |
2.178432 |
0.341075 |
0.000000 |
1.000000 |
0.195782 |
| 2048 |
0.697854 |
0.151148 |
2.182067 |
0.197854 |
0.000000 |
1.000000 |
0.115009 |
| 4096 |
1.000000 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
0.125000 |
0.000000 |
Cos²(θ) Alignment (k=1)
Mean cos²(θ): 0.023197
Max cos²(θ): 0.063815
Min cos²(θ): 0.000086
Std cos²(θ): 0.025389
Per-expert cos²(θ) values:
| Expert | cos²(θ) | align |
| 0 |
0.000086 |
0.000086 |
| 1 |
0.031873 |
0.031873 |
| 2 |
0.006196 |
0.006196 |
| 3 |
0.015753 |
0.015753 |
| 4 |
0.008216 |
0.008216 |
| 5 |
0.001565 |
0.001565 |
| 6 |
0.063815 |
0.063815 |
| 7 |
0.058069 |
0.058069 |
Detailed Results by K Value
K = 1:
- Mean align: 0.023197
- Mean z-score: 1.49
- Mean effect over random: 0.022952
- Argmax accuracy: 0.500 (random baseline: 0.125)
- Alignment margin: 0.011800
K = 2:
- Mean align: 0.259938
- Mean z-score: 2.39
- Mean effect over random: 0.259449
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.226243
K = 4:
- Mean align: 0.286469
- Mean z-score: 2.61
- Mean effect over random: 0.285492
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.248632
K = 8:
- Mean align: 0.307733
- Mean z-score: 2.33
- Mean effect over random: 0.305780
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.258187
K = 16:
- Mean align: 0.332480
- Mean z-score: 2.51
- Mean effect over random: 0.328573
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.260972
K = 32:
- Mean align: 0.367138
- Mean z-score: 2.65
- Mean effect over random: 0.359326
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.268582
K = 64:
- Mean align: 0.408700
- Mean z-score: 2.43
- Mean effect over random: 0.393075
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.273820
K = 128:
- Mean align: 0.453305
- Mean z-score: 2.55
- Mean effect over random: 0.422055
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.278289
K = 256:
- Mean align: 0.491464
- Mean z-score: 2.56
- Mean effect over random: 0.428964
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.268457
K = 512:
- Mean align: 0.530967
- Mean z-score: 2.35
- Mean effect over random: 0.405967
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.242936
K = 1024:
- Mean align: 0.591075
- Mean z-score: 2.18
- Mean effect over random: 0.341075
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.195782
K = 2048:
- Mean align: 0.697854
- Mean z-score: 2.18
- Mean effect over random: 0.197854
- Argmax accuracy: 1.000 (random baseline: 0.125)
- Alignment margin: 0.115009
K = 4096:
- Mean align: 1.000000
- Mean z-score: 0.00
- Mean effect over random: 0.000000
- Argmax accuracy: 0.125 (random baseline: 0.125)
- Alignment margin: 0.000000
Unique Identification Summary
Interpretation:
- Best unique identification at k=2: Argmax accuracy = 1.000, Margin = 0.226243
- Random baseline: 0.125 (1/8 experts)
- ✓ Excellent unique identification: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a highly separable and discriminative manner.
- Positive margin (0.226243): Correct expert has higher alignment than all other experts, enabling unique identification.
Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement than just being above chance.
Unique Identification Analysis
Purpose: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. While shuffled-baseline tests rule out random router-expert pairing, argmax and margin tests assess whether routing vectors encode expert identity in a separable and discriminative manner.
Key Questions:
- Does the correct expert achieve the maximum alignment? (Argmax accuracy)
- What is the separation between correct and next-best expert? (Alignment margin)
- Is alignment strong enough to uniquely identify experts from weights alone?
Unique Identification Analysis
Unique Identification Analysis: This analysis tests whether router-expert alignment is not just non-random, but actually uniquely identifying. It goes beyond shuffled-baseline tests to assess whether routing vectors encode expert identity in a separable and discriminative manner.
- Argmax Accuracy vs k: Fraction of router vectors where the correct expert achieves the maximum alignment among all experts.
- Formula: For each router vector r_i (assigned to expert i), compute alignment with ALL experts: align(r_i, Expert_j) for all j. Argmax accuracy = (1/n) · Σᵢ [argmax_j align(r_i, Expert_j) == i], where [·] is 1 if true, 0 otherwise.
- Interpretation: Measures whether the correct expert-expert pairing can be uniquely identified from the alignment matrix. Value of 1.0 means perfect identification (every router's correct expert has the highest alignment). Value of 1/n_experts (e.g., 0.125 for 8 experts) means random guessing.
- Computation: (1) For each router vector r_i, compute alignment with all experts' principal subspaces, forming a row of the alignment matrix, (2) Find which expert has maximum alignment: argmax_j align(r_i, Expert_j), (3) Check if argmax equals the correct expert i, (4) Average across all routers to get accuracy.
- Range: [1/n_experts, 1.0]. Value of 1.0 indicates perfect unique identification. Value near 1/n_experts indicates alignment is not discriminative enough to identify experts uniquely.
- Comparison to shuffled baseline: While shuffled baseline tests whether alignment is above chance, argmax accuracy tests whether alignment is strong enough to uniquely identify the correct expert from all possible experts.
- Alignment Margin vs k: Mean difference between correct expert's alignment and the next-best expert's alignment.
- Formula: For each router vector r_i: margin_i = align(r_i, Expert_i) - max_{j≠i} align(r_i, Expert_j). Mean margin = (1/n) · Σᵢ margin_i.
- Interpretation: Measures the separation between correct and incorrect expert alignments. Positive margin means correct expert has higher alignment than all others (unique identification possible). Negative margin means another expert has higher alignment (misidentification).
- Computation: (1) For each router r_i, compute alignments with all experts, (2) Get correct alignment: align_correct = align(r_i, Expert_i), (3) Get maximum among other experts: align_max_other = max_{j≠i} align(r_i, Expert_j), (4) Compute margin = align_correct - align_max_other, (5) Average across all routers.
- Range: Can be negative or positive. Positive values indicate correct expert is best (unique identification). Negative values indicate another expert is better (confusion). Larger positive margins indicate stronger discriminative power.
- Relationship to argmax accuracy: When margin > 0, argmax accuracy = 1.0. When margin < 0, argmax accuracy < 1.0. Margin quantifies the strength of separation even when argmax is correct.
- Alignment Matrix: Full [n_experts × n_experts] matrix where entry (i, j) is the alignment of router vector i with expert j's principal subspace.
- Formula: Matrix[i, j] = align(r_i, Expert_j) = projection energy of router i onto expert j's top-k singular subspace.
- Interpretation: Shows the complete alignment landscape. Diagonal entries (i, i) are correct pairings. Off-diagonal entries show how well routers align with "wrong" experts. For unique identification, diagonal should be the maximum in each row.
- Computation: For each router-expert pair (i, j), compute alignment using the same projection energy formula as the main analysis.
- What to look for: Strong diagonal pattern (diagonal entries are highest in each row) indicates unique identification. Weak diagonal or strong off-diagonal entries indicate confusion between experts.
- Margin Distribution: Histogram of alignment margins across all router vectors for different k values.
- Interpretation: Shows the distribution of discriminative power. Most routers with positive margins indicate good unique identification. Many routers with negative margins indicate frequent misidentification.
- What to look for: Distribution shifted to the right (positive values) indicates strong unique identification. Distribution centered near zero or shifted left indicates weak or no unique identification.
Interpretation Summary:
- Argmax accuracy near 1.0 + positive margins: Alignment is strong enough to uniquely identify experts from weights alone. Router vectors encode expert identity in a separable and discriminative manner.
- Argmax accuracy above random but < 1.0: Alignment is above chance but not strong enough for perfect unique identification. Some routers may be confused with other experts.
- Argmax accuracy near random baseline: Alignment is not discriminative enough to identify experts uniquely, even though it may be above the shuffled baseline (non-random but not uniquely identifying).
- Comparison to shuffled baseline: The shuffled baseline tests whether router-expert pairs are non-random. The argmax and margin tests assess whether this non-random alignment is strong enough to uniquely identify the correct expert from all possible experts, which is a stronger requirement.
Inter-Expert Orthogonality Analysis
Purpose: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
Method: For each expert, we extract its top singular vector (principal direction) from SVD. We then compute cosine similarity between all pairs of experts' principal directions.
Interpretation:
- Diagonal pattern (high self-similarity, low off-diagonal): Experts are orthogonal and diverse - good!
- High off-diagonal values: Experts are similar to each other - potential expert collapse or redundancy.
- Block patterns: Groups of experts are similar to each other - partial collapse.
Orthogonality Statistics:
- Mean off-diagonal similarity: 0.1581
- Mean absolute off-diagonal similarity: 0.4320
- Max off-diagonal similarity: 0.6343
- Min off-diagonal similarity: -0.6699
- Std off-diagonal similarity: 0.4580
Note: Off-diagonal values exclude self-similarity (diagonal). Lower values indicate better orthogonality.
Inter-Expert Orthogonality Heatmap (k=2)
Inter-Expert Orthogonality: This analysis measures how similar experts are to each other within the same layer, detecting potential "Expert Collapse" where multiple experts learn redundant representations.
- Method:
- Step 1 - Extract Principal Directions: For each expert in the layer, extract its top singular vector (v₁ = V[:, 0]) from the SVD decomposition. This vector represents the expert's "main direction" or principal component.
- Step 2 - Compute Similarity Matrix: Calculate the cosine similarity between every pair of experts' top singular vectors. The result is a [n_experts, n_experts] matrix where entry (i, j) is the cosine similarity between expert i and expert j's principal directions.
- Formula: similarity[i, j] = (v₁ᵢ · v₁ⱼ) / (||v₁ᵢ|| · ||v₁ⱼ||) = v₁ᵢ · v₁ⱼ (since vectors are normalized), where v₁ᵢ is the top singular vector of expert i.
- Visualization:
- Heatmap: The similarity matrix is displayed as a heatmap with expert indices on both axes.
- Colormap: Uses 'coolwarm' diverging colormap showing signed cosine similarity: blue (-1 = opposite directions) → white (0 = orthogonal) → red (+1 = same direction). The diagonal always shows 1.0 (experts are perfectly similar to themselves).
- Grid: Grid lines separate cells for better readability.
- Annotations: For small matrices (≤16 experts), similarity values are displayed as text annotations on each cell.
- Interpretation:
- Diagonal pattern (good): If the matrix shows high values (near 1.0) only on the diagonal and low values (near 0.0) off-diagonal, experts are orthogonal and diverse. This is the desired behavior - each expert specializes in a different direction.
- High off-diagonal values (bad): If off-diagonal entries are high (e.g., > 0.5), it indicates that multiple experts have similar principal directions. This suggests "Expert Collapse" - experts are redundant and not utilizing their full capacity.
- Block patterns (partial collapse): If there are blocks of high similarity (e.g., experts 0-3 are similar to each other, experts 4-7 are similar to each other), it indicates partial collapse where groups of experts are redundant.
- Mean off-diagonal similarity: A useful summary statistic. Values near 0 indicate good orthogonality. Values > 0.3 suggest significant redundancy. Values > 0.7 indicate severe expert collapse.
- Statistics Reported:
- Mean off-diagonal similarity: Average cosine similarity between different experts (excludes diagonal). Lower is better.
- Mean absolute off-diagonal similarity: Average absolute value of off-diagonal similarities (ignores sign). Lower is better.
- Max off-diagonal similarity: Maximum similarity between any two different experts. If this is high (> 0.7), at least two experts are very similar.
- Min off-diagonal similarity: Minimum similarity (can be negative if experts point in opposite directions).
- Std off-diagonal similarity: Standard deviation of off-diagonal similarities. Higher values indicate more variability in expert relationships.
- Why This Matters:
- Expert Diversity: MoE models are designed to have diverse experts that specialize in different aspects of the input. If experts collapse, the model is not utilizing its full capacity.
- Efficiency: Redundant experts waste model parameters and computation. Orthogonal experts maximize the model's representational capacity.
- Training Health: Expert collapse can indicate training issues (e.g., insufficient load balancing, poor routing, or optimization problems).
Inter-Expert Orthogonality Comparison Across k Values
Purpose: This comparison shows how expert orthogonality changes when considering different numbers of principal directions (k). Higher k values consider more singular vectors, potentially revealing more subtle similarities between experts.
Complete Analysis Plots
Comprehensive visualization of all metrics for this run.
Complete Analysis Visualization
See "Plot Explanations" section at the top of this report for detailed information about this plot.